CN102054107B - Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device - Google Patents

Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device Download PDF

Info

Publication number
CN102054107B
CN102054107B CN 201010619832 CN201010619832A CN102054107B CN 102054107 B CN102054107 B CN 102054107B CN 201010619832 CN201010619832 CN 201010619832 CN 201010619832 A CN201010619832 A CN 201010619832A CN 102054107 B CN102054107 B CN 102054107B
Authority
CN
China
Prior art keywords
operator
data
integrated circuit
control flow
spacetime diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010619832
Other languages
Chinese (zh)
Other versions
CN102054107A (en
Inventor
王新安
胡子一
安辉耀
谢峥
王腾
张兴
周生明
赵秋奇
马芝
孙亚春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN 201010619832 priority Critical patent/CN102054107B/en
Publication of CN102054107A publication Critical patent/CN102054107A/en
Application granted granted Critical
Publication of CN102054107B publication Critical patent/CN102054107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

The invention discloses a lower hardware mapping method and device of an integrated circuit, wherein a computer language program describing an integrated circuit algorithm is analyzed, mapped to a data control flow diagram and converted into an operator time-space diagram; the data control flow diagram is subject to time-sequence constraint; the operator time-space diagram is subject to cluster compression according to a time sequence label; and logic description of a lower hardware circuit of the integrated circuit is generated so as to create a mapping tool from the computer language to the lower hardware circuit of the integrated circuit; and thus, a process of generating the lower hardware of the integrated circuit from languages such as C, MATLAB and the like is realized in a standard manner, and the realization is convenient and quick. In the operator time-space diagram generation method and device disclosed by the invention, the data is expanded according to the data dependency of a data flow in a data control flow, and an operator is scheduled to convert the data control flow diagram into the operator time-space diagram; and according to the circuit obtained by the method, the layout regularity is improved, and optimal design of low energy consumption can be realized.

Description

Lower hardware mapping method of integrated circuit, space-time drawing generating method and device
Technical field
The present invention relates to integrated circuit (IC) design field, especially a kind of lower hardware mapping method of integrated circuit, space-time drawing generating method and device.
Background technology
In integrated circuit fields, the design rate of integrated circuit lags behind the speed of development of integrated circuit fabrication process usually.Especially after the manufacturing process of integrated circuit entered nanoscale, the design rate of integrated circuit had lagged far behind the speed of development of integrated circuit fabrication process.Therefore, for the integrated circuit (IC) design field, improving design rate is current one of urgent problems the most.As shown in Figure 1, in prior art, the design of integrated circuit generally includes two parts: first is from based on the description to the RTL level of the arthmetic statement of C language or MATLAB language; Second portion is from rtl description to standard block ASIC structure or the implementation procedure of gate array existing (or other S-ASIC structure) or FPGA structure.The wherein realization of the second portion instrument support of existing comparative maturity at present, its implementation procedure satisfies the requirements such as efficient, quick substantially.Therefore, the key point that improves design rate has dropped in the realization of first, namely from the arthmetic statement of C language or MATLAB language etc. to the description of RTL level, this can be referred to as the mapping method of integrated circuit lower hardware or High Level Synthesis or structural level comprehensive.
But due to the realization of first be mainly by the technician according to self understanding to C language or MATLAB language, be converted into artificially the description of RTL level.That is to say, the realization of first is subject to the impact of technician's self experience and knowledge level, and for different technician, there is larger difference in the time of realization.Implementation for first, some external companies have launched corresponding research and have pushed away some implementation tools, such as Cynthesizer, the SPARK of UC San Diego etc. of AutoPilot, the Fore Design System of Catapult C, the AutoESL of Mentor.
Summary of the invention
The main technical problem to be solved in the present invention is, a kind of lower hardware mapping method of integrated circuit and device are provided, and can improve the design rate of integrated circuit.
The present invention also provides a kind of space-time drawing generating method and device, makes the circuit that obtains according to this method strengthen the regularity of domain aspect area; Aspect power consumption, realized the optimal design of low power dissipation design.
For solving the problems of the technologies described above, the technical solution used in the present invention is as follows:
A kind of lower hardware mapping method of integrated circuit comprises step:
The process analysis step is used for reading the computer language procedure of describing the integrated circuit algorithm, and therefrom identifies mapped execution object and parameter object;
The Data Control flow graph generates step, is used for the respective nodes that the execution object that will identify and parameter object are mapped to the Data Control flow graph of describing the integrated circuit algorithm;
The operator spacetime diagram generates step, be used for function treatment that each node according to the Data Control flow graph carries out and obtain at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit;
The temporal constraint step is used for requiring and total temporal constraint is determined in the requirement of target integrated circuit technology according to user specification, and each operator unit label time in the operator spacetime diagram carries out temporal constraint to each level of operator spacetime diagram;
The spacetime diagram compression step is used for according to time-labeling, the cluster that the operator spacetime diagram carries out on the space being compressed, and makes overall algorithm execution time close to total temporal constraint;
The lower hardware mapping step is used for generating integrated circuit lower hardware logical description according to the operator spacetime diagram after the cluster compression.
Based on above-mentioned method, the present invention also provides a kind of integrated circuit lower hardware mapping device, comprising:
The process analysis module is used for reading the computerese of describing the integrated circuit algorithm, and therefrom identifies mapped execution object and parameter object;
Data Control flow graph generation module is used for the respective nodes that the execution object that will identify and parameter object are mapped to the Data Control flow graph of describing the integrated circuit algorithm;
Operator spacetime diagram generation module, be used for function treatment that each node according to the Data Control flow graph carries out and take out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit;
The temporal constraint module is used for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, and each operator unit label time in the operator spacetime diagram carries out temporal constraint to each level of operator spacetime diagram;
The spacetime diagram compression module is used for according to time-labeling, the cluster that spacetime diagram carries out on the space being compressed, and makes it overall algorithm execution time close to total temporal constraint;
The lower hardware mapping block generates integrated circuit lower hardware logical description according to the spacetime diagram after the cluster compression.
The present invention also provides a kind of space-time drawing generating method, comprises step:
The Data Control flow graph is launched according to its data dependence;
The function treatment of carrying out according to each node after launching is taken out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is comprised of the operator unit.
Further, the data stream in described Data Control flow graph is order related data flow structure, adopts the mode of streamline to launch described order related data flow, and converts the operator spacetime diagram that is comprised of the operator unit to.
Further, there is feedback in data stream in described Data Control flow graph, and there is not data dependence between described data stream internal data, described internal data is not existed each data stream of data dependence to adopt the mode of local flow's waterline to launch, and convert the operator spacetime diagram that is formed by the operator unit to.
Further, there is not data dependence between the data stream in described Data Control flow graph, adopts parallel mode to launch described data stream, and convert the operator spacetime diagram that is formed by the operator unit to.
The spacetime diagram generating apparatus is characterized in that comprising:
The operator cell library is used for the operator unit that storage can realize calculation function;
Launch the unit, be used for the Data Control flow graph is launched according to its data dependence;
Operator spacetime diagram generation module, be used for the function treatment of carrying out according to each node after launching and take out at least one operator unit of corresponding function from the operator cell library, each functional module in the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit.
Further, also comprising the judging unit for judgement data flow dependency type, organizes work accordingly according to the data flow dependency type in described expansion unit.
Further, when the data stream in the described Data Control flow graph of described judgment unit judges was order related data flow structure, described expansion unit adopted the mode of streamline to launch described order related data flow; When having feedback in the data stream in the described Data Control flow graph of described judgment unit judges, described expansion unit judges exists between the data of data stream inside of feedback whether have data dependence, if do not have data dependence between the data of described data stream inside, described expansion unit does not exist each data stream of data dependence to adopt the mode of local flow's waterline to launch described internal data; When not having data dependence between the data stream in described judgment unit judges Data Control flow graph, described expansion unit adopts parallel mode to launch described data stream.
Further, described operator spacetime diagram generation module first is converted to the operator spacetime diagram with the Data Control flow graph of function, and then the Data Control flow graph with the subfunction of described function call is converted to the operator spacetime diagram.
The invention has the beneficial effects as follows: by the analysis to former c program or MATLAB program, identify execution object and the parameter object of mapping; And the execution object that will identify and parameter object remap and become the Data Control flow graph, and this Data Control flow graph can represent the algorithm of integrated circuit; Then according to data dependence, the Data Control flow graph is launched, and the node after launching substitutes the generating operator spacetime diagram with operator; The operator spacetime diagram that generates is through the cluster compression, makes the overall execution time of the spacetime diagram after compression close to total temporal constraint; Spacetime diagram after compression is generated the lower hardware circuit of integrated circuit.Thereby created a kind of mapping tool from computerese to integrated circuit lower hardware circuit, realized to standardization the process of integrated circuit from language generation lower hardwares such as C or MATLAB, implemented convenient and swift.
The present invention is also by generating the spacetime diagram based on operator, because operator can consist of the operator function piece, and then consists of operator function group, makes the circuit that obtains according to this method aspect area, can strengthen the regularity of domain; Aspect power consumption, control operational pattern and clock frequency that operator can configure operator, the low power dissipation design that is optimized.
Description of drawings
Fig. 1 is method of designing integrated circuit process flow diagram of the prior art;
Fig. 2 is ADDS Operator structure schematic diagram;
Fig. 3 is ADDS operator function figure;
Fig. 4 is the universal architecture schematic diagram of storage class operator;
Fig. 5 is the universal architecture schematic diagram of class of paths operator;
Fig. 6 is for controlling the universal architecture schematic diagram of class operator;
Fig. 7 is lower hardware mapping process flow diagram of the present invention;
Fig. 8 is the process flow diagram of in a kind of embodiment of Data Control flow graph generation method of the present invention, function X264_me_search and pixel_sad_16 * 16 being analyzed;
Fig. 9 controls the process flow diagram of flow graph for the generated data as a result that obtains according to analytical procedure in Fig. 8;
Figure 10 is the mapping structure figure of the individual layer dynamic circulation statement in a kind of embodiment of the present invention;
Figure 11 is a kind of mapping structure figure of the multilayer dynamic circulation statement in a kind of embodiment of the present invention;
Figure 12 is another mapping structure figure of the multilayer dynamic circulation statement in a kind of embodiment of method of the present invention;
Figure 13 is the structural drawing of the branch's control statement mapping in a kind of embodiment of the present invention;
Figure 14 is the structural drawing of the nested branch control statement mapping in a kind of embodiment of the present invention;
Figure 15 is the Data Control flow graph of the function X264_me_search of the present embodiment;
Figure 16 is the Data Control flow graph of the function pixel_sad_16 * 16 of the present embodiment;
Figure 17 is the structural drawing of a kind of embodiment of order related data flow expansion of the present invention;
Figure 18 is the feedback data stream that exists of the present invention, expands into local flow's line structure schematic diagram by operator;
Figure 19 is the structural drawing of a kind of embodiment of parallel data stream expansion of the present invention;
Figure 20 is the L0 of function x264_me_search of the present embodiment and the operator spacetime diagram that the L1 logic generates;
The first of the logic L3 of Figure 21 a and Figure 21 b difference function x264_me_search and the operator spacetime diagram that second portion generates;
Figure 22 is the operator spacetime diagram that the logic L5 of the function X264_me_search of the present embodiment generates;
Figure 23 a and Figure 23 b are respectively the function x264_me_search of the present embodiment and the operator spacetime diagram of letter pixel_sad_16 * 16;
Figure 24 is the structured flowchart of a kind of embodiment of spacetime diagram generating apparatus of the present invention;
Figure 25 is the structured flowchart of a kind of embodiment of the spacetime diagram generation module of Figure 24;
Figure 26 a and Figure 26 b are respectively the function X264_me_search of the present embodiment and the temporal constraint schematic diagram of function pixel_sad_16 * 16;
Figure 27 is the spacetime diagram after the function pixel_sad_16 * 16 clusters compression of the present embodiment;
Shown in Figure 28 for solidifying a kind of embodiment schematic diagram of customization;
Figure 29 is the comparison diagram after the compression of X264_me_search cluster.
Embodiment
By reference to the accompanying drawings the present invention is described in further detail below by embodiment.
Look back the development course that method of designing integrated circuit is learned, can see: enter the epoch of 1um when integrated circuit fabrication process, the method for designing take gate array as elementary cell occurred; Enter the epoch of 0.5um when integrated circuit fabrication process, the method for designing take standard block as elementary cell occurred; Enter the epoch of 0.18um when integrated circuit fabrication process, the method for designing take IP kernel as elementary cell occurred.This shows: the design methodology of integrated circuit is along with the development of integrated circuit fabrication process on the one hand, and the unit granularity of the elementary cell (door, standard block, IP kernel) of using during method of designing integrated circuit is learned on the other hand constantly increases.Simultaneously, the appearance of each new elementary cell all indicates the revolutionary progress of method of designing integrated circuit.Therefore, what can rationally predict is, progress at full speed along with integrated circuit fabrication process over past ten years, especially after integrated circuit fabrication process enters nanoscale, more the new situation of integrated circuit (IC) design will appear and open in the elementary cell of coarsegrain, to adapt to the develop rapidly of integrated circuit fabrication process.
Operator is as the elementary cell in the integrated circuit building block, its granularity is greater than the granularity of standard block, therefore the present invention adopts the method for designing integrated circuit based on operator, makes the design rate of having accelerated integrated circuit, to adapt to the progress of integrated circuit fabrication process.
In the present invention, operator commonly used has five classes, is respectively computing class operator, storage class operator, class of paths operator, controls class operator and clock class operator.
1, computing class operator.
Computing operator (AU) is be used to the elementary cell that realizes logical operation or arithmetical operation or the hybrid operation of logical and arithmetic.It comprises arithmetic logical unit and computing configuration register, the computing configuration register is used for receiving and storage computing configuration-direct, the arithmetical logic operation that different computing configuration-directs is corresponding different, that is to say, can make same computing operator realize multiple different function by the computing configuration-direct.Below, as an example of the ADDS operator example, the computing operator is described.
Fig. 2 is the structural representation of ADDS operator, it comprises for "/" unit of realizing adding the ADD unit of reducing and being used for realizing shifting function, by the parameter value of control bit X is set, can make the ADDS operator realize multiple different function, show the corresponding relation of different control bit X values and different operating in a kind of embodiment such as, the form of Fig. 3.Can realize that the operator of multiple difference in functionality is called the restructural operator by control bit X as ADDS is this, the restructural operator has reduced because abundant application function can be used in different scenes the operator number that stores in the operator cell library.And the restructural operator can also be realized dynamic reconstruct by the mode that changes control bit in its implementation.
2, storage class operator.
Be illustrated in figure 4 as the basic structure schematic diagram (in figure, CU represents to control operator) of storage class operator (MU).Storage operators comprises stored configuration register (MU configuration register) and storage unit, and storage unit comprises address-generation unit, data-carrier store, data generation unit and data output control unit.The stored configuration register can be by data output control unit configuration store operator (MU) memory bank (various storage mediums: writing and/or the playback mode MEM such as register, RAM), the working method of can also config memory corresponding address-generation unit.Directly will input data according to the address of address-generation unit generation and store the precalculated position into, and the data of needs will be exported from deposit position.
3, class of paths operator.
As shown in Figure 5, be the universal architecture schematic diagram of class of paths operator (LU).Class of paths operator LU comprises routing configuration register (LU configuration register) and forms alteration switch and the data register (REG) of Route Selection unit, wherein, the routing configuration register is controlled the control of operator CU, controls alteration switch and realize connection between nonidentity operation operator AU according to the mode of expectation under the control action of controlling operator CU.Data register is used for the inputoutput data of temporary computing class operator LU and storage class operator M U.
4, control class operator.
As shown in Figure 6, for controlling the universal architecture schematic diagram of class operator (CU).The effect of controlling class operator is mainly that configuration information is sent to corresponding configuration register, and configuration computing operator AU, storage operators MU and path operator LU realize predetermined function.The form of controlling operator CU comprises three kinds of counter, state machine and micro-orders.Wherein the micro-order structure comprises code translator, programmable counter, command memory and Pipeline control module etc.Control operator CU and send configuration information by carrying out simple configuration-direct to each functional unit, the instruction of supporting due to CU seldom, so the order register capacity is little, code translator is very simple.
5, clock class operator.
The clock operator is used for computing class operator, storage class operator, class of paths operator and controls the clock control signal of class operator, and clock signal comprises the signal of controlling the clock start-stop and controlling clock frequency, and clock signal can configure according to the mode of expectation.
Above five class operators are the bases of realizing following embodiment, be understandable that, above-mentioned to will be divided into for the operator of integrated circuit (IC) design five large classes and not exclusive dividing mode according to function, can also carry out targetedly according to actual conditions the division of wide region more or thinner scope.
In an embodiment of the present invention, providing a kind of mapped system from computerese to integrated circuit lower hardware circuit, is the lower hardware mapping method of integrated circuit of this system as shown in Figure 7a, comprises the following steps:
Step S1 analyzes program, namely reads the computer language procedure of describing the integrated circuit algorithm, identifies mapped execution object and parameter object according to the rule of this computerese from described computer language procedure.Special IC be used for realizing specific agreement or function, and at first these functions and agreement is described with computer language procedure usually, and computerese wherein adopts C language or MATLAB language etc. usually.The computer language procedure of writing is input in mapped system of the present invention, this mapped system identifies mapped execution object and parameter object according to the rule of coding computerese used from described computer language procedure again.
In present embodiment, this execution object comprises operational order and/or steering order, and this parameter object comprises at least a in input data, output data, intermediate data.Operational order in the present embodiment comprises and adding, subtracts, takes advantage of and the computing such as displacement.
Step S2, generated data control flow graph, and the execution object that identifies and parameter object are mapped to respective nodes in the Data Control flow graph of describing the integrated circuit algorithm.Described operational order is mapped as the processing block diagram, described steering order is mapped as for the control of identification-state, state transitions condition and state control signal stream, described input data, output data and intermediate data are mapped as memory node on data stream.
Step S3, the operator spacetime diagram generates step, be used for processing capacity that each node according to the Data Control flow graph carries out and take out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit.First the Data Control flow graph is launched according to its data flow dependency, each node after then launching converts the operator unit that can complete this nodal function to.Replace each node in the Data Control flow graph with the combination of one or more operators unit, the combination of one or more operators unit can be completed the function identical with each node.。For how the Data Control flow graph being launched, include but not limited to following several mode: if the data stream in the Data Control flow graph is order related data flow structure, adopt the mode of streamline to launch the order related data flow; If there is feedback in the data stream in the Data Control flow graph, namely this data stream is a circulation time, there is data dependence in this data stream, this data stream can not be converted into flowing structure, if but when not having data dependence between this data stream internal data, described internal data is not existed each data stream of data dependence to adopt the mode of local flow's waterline to launch; If there is not data dependence between the data stream in the Data Control flow graph, adopts parallel mode to launch this data stream, and convert the operator spacetime diagram that is formed by the operator unit to.
Step S4, the temporal constraint step is used for requiring and total temporal constraint is determined in the requirement of target integrated circuit technology according to user specification, to each operator unit label time in the operator spacetime diagram.On the other hand, can extract the operator time sequence information from the operator cell library, the operator spacetime diagram is done the sequential mark, form the object of temporal constraint.Thereby can with temporal constraint each level specific to the operator spacetime diagram, realize each level of operator spacetime diagram is carried out temporal constraint according to data flow characteristic.Because operator can consist of different operator function pieces, and then consist of different operator function groups, each operator function group is an operator level.
If described data flow architecture is parallel data stream, total temporal constraint is divided equally each the operator level that is given in corresponding spacetime diagram, and divided the temporal constraint of each operator level equally in this operator level each operator unit.The basic sequential unit of the operator that corresponding each operator level of each node of serial in Data Control stream is total is as overall temporal constraint, and the ratio of the sequential summation that the sequential of the computing operator that shines upon according to the longest arithmetic path in each operator level accounts for that in each operator level, the longest arithmetic path shines upon operator unit is corresponding is distributed the sequential of each operator level.
Step S5, the spacetime diagram compression step is used for according to time-labeling, spacetime diagram being carried out the cluster compression in space (being on hardware resource or area), and makes it overall algorithm execution time close to total temporal constraint.
In one embodiment, spacetime diagram is compressed comprise the following steps: find out the identical computing class operator of attribute and/or the identical storage class operator of memory attribute in the operator spacetime diagram; Then according to time-labeling to the identical computing class operator of operational attribute spatially merge the compression and/or the storage class operator that memory attribute is identical spatially merge compression; Then introduce and control class operator, the computing class operator after compression and/or storage class computing operator are generated the corresponding configuration instruction, realize the multiplexing of computing class operator and/or storage class operator.
The step of cluster compression step and generation restructural operator function piece all can produce not only a kind of result.The same subfunction of different function calls, because confinement time is different, the cluster result that produces is also different.Therefore need to be optimized according to parameters such as time, area, power consumptions, by performance (execution time) discharge order, just the cluster result that satisfies time-constrain represents that its hardware realizes Least-cost, therefore selects overall algorithm execution time close to the optimum results of the spacetime diagram of completing the needed total temporal constraint of integrated circuit algorithm as the cluster compression.
Step S6, the lower hardware mapping step generates integrated circuit lower hardware logical description according to the spacetime diagram after the cluster compression.
Based on above-mentioned lower hardware mapping method of integrated circuit, the invention also discloses a kind of integrated circuit lower hardware mapping device, please refer to Fig. 7 b, the integrated circuit lower hardware mapping device of the present embodiment comprises:
Process analysis module 1 is used for reading the computer language procedure of describing the integrated circuit algorithm, identifies mapped execution object and parameter object according to the rule of this computerese from described computer language procedure; Data Control flow graph generation module 2, the execution object and the parameter object that are used for identifying are mapped to the Data Control flow graph of describing the integrated circuit algorithm; Operator spacetime diagram generation module 3, be used for each node of Data Control flow graph is launched according to data flow dependency, and take out at least one operator unit of corresponding function according to the processing capacity that each node after launching carries out, thereby convert the Data Control flow graph to formed by the operator unit operator spacetime diagram from the operator cell library of setting up in advance; Temporal constraint module 4 is used for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, and each operator unit label time in the operator spacetime diagram carries out temporal constraint to each level of operator spacetime diagram; Spacetime diagram compression module 5 carries out the cluster compression on the space, and makes it overall algorithm execution time close to total temporal constraint when being used for according to time-labeling spacetime diagram; Lower hardware mapping block 6 generates integrated circuit lower hardware circuit according to the spacetime diagram after the cluster compression.
Below in conjunction with specific embodiment, lower hardware mapping method of integrated circuit of the present invention and device are described.
H.264 be the common digital video coding standard of formulating of the common joint video team (JVT) of setting up of International Telecommunication Association (ITU-T) and ISO (International Standards Organization) (ISO).Take the X264_me_search function of the H.264 C language description of standard as example, method of the present invention is described in more detail in the present embodiment.
As shown in Figure 8, for being analyzed, function X264_me_search comprises step:
S11, read computer language procedure, and search function in this computer language procedure.In the present embodiment, at first the fetch program, go forward side by side lang method and lexical analysis obtain function X264_me_search.
S12, this function is resolved, obtain function calling relationship and the parameter object of this function, this parameter object comprises input data, output data, the input constant of this function, and be somebody's turn to do the upward intermediate data of layer functions, and each parameter object of this function is carried out the mark of the information such as corresponding data dependence, shared storage, distributed store.In the present embodiment, function x264_me_search analyzed obtain its input variable, output variable, input constant and the output constant is as shown in table 1:
Table 1:
Signal name Data type Direction Explanation
i_pixel Int IN //PIXEL_WxH
lm Int IN //lambda motion
p_fref uint8_t* IN // reference frame
p_fenc uint8t_* IN // coded frame
i_stride Int IN The width of // image
i_mv_range Int IN The maximum magnitude of // motion vector
mvp[2] Int IN // motion vectors
cost Int OUT //satd+lm*nbits
mv[2] Int OUT // motion is vowed virgin
In the present embodiment, internal analysis obtains its built-in variable and constant to this function x264_me_search, and is as shown in table 2:
Table 2:
Signal name Data type Explanation
i_pixel int // pixel WxH
bcost int // interim optimum estimate point
bmx int // interim motion vector x component
bmy int // interim motion vector y component
p_fref uint8_t* // reference frame address
i_iter int // loop variable
The current resolved function of function calling relationship judgement of the current resolved function of S13, basis comprises lower layer functions, searches in this way lower layer functions, and execution in step S14, otherwise end operation.In the present embodiment, if do not comprise lower layer functions in current resolved function, this function is also the bottom function.In this function is carried out resolving, function pixel_sad_16 * 16 that obtained this function call, function x264_me_search is upper layer functions, function pixel_sad_16 * 16 are lower layer functions.In the present embodiment, if do not comprise lower layer functions in current resolved function, this function is also the bottom function.
S14, lower layer functions is analyzed, obtained the parameter object of this time layer functions, comprise input data and output data, and each parameter object of this time layer functions is carried out the mark of the information such as data dependence, shared storage, distributed store.In the present embodiment, lower layer functions pixel_sad_16 * 16 are analyzed, obtained the input and output of this time layer functions pixel_sad_16 * 16, as shown in table 3:
Table 3:
Figure BDA0000042572150000091
In present embodiment, data dependence refers to analyze the relation that is associated between the variable that draws and/or constant, comprises that computing is relevant relevant with storage.Wherein, computing is relevant is that input and output are relevant, and output variable or the output signal of process computing operator are relevant to its input variable or input signal; Storage is relevant comprises that write-read is relevant, read-write is relevant and write relevant, wherein, write-read is relevant to be referred to sequentially read for first writing for the variable proper operation of same memory address again, namely reads variable and is relevant to and writes variable, if traffic error namely occurs write-after-read in operation; Read-write is relevant to be referred to sequentially write for first reading for the variable proper operation of same memory address again, namely writes variable and is relevant to and reads variable, if write-then-read is that read data is capped and makes a mistake in operation; Writing is correlated with refers to will write according to the proper operation order for the variable of same memory address, namely writes and has correlativity between variable.
In present embodiment, shared storage refers to be connected between each storage of same module and can mutually access, and the data between them can be by sharing the storage mode exchange.And distributed store refers to that each module has the storage of its independent allocation, the storage of each other module of module inaccessible, and the data communication between them can only be mutual by communication port.In hardware design, the shared storage distributed store that compares has increased extra interconnect resources, therefore will tell according to its content when generating algorithm Data Control flow graph and share storage and distributed store.
As shown in Figure 9, according to execution object and the parameter object that analysis obtains, generated data is controlled flow graph and is comprised step:
S21, the parameter object that identification is obtained are mapped as the memory node on data stream, and distinguish sharing storage and distributed store according to the storage information of mark.
S22, according to the data dependence of mark, the operational order in function is mapped as processing block diagram in the Data Control flow graph.The computings such as the operational order in the present embodiment comprises and adding, subtracts, displacement.
S23, according to the data dependence of mark, the steering order in function is mapped as in the Data Control flow graph control stream that is used for identification-state, state transitions condition and state control signal.
In the present embodiment, steering order comprises call relation between function, recursion instruction etc.Loop statement comprises quiet cycle and dynamic circulation statement, and wherein the dynamic circulation statement comprises again dynamic circulation, individual layer dynamic circulation and the multilayer dynamic circulation that can be changed into quiet cycle.Quiet cycle refers to that loop variable is constant; The dynamic circulation that can be changed into quiet cycle refers to that cycle index is variable, but in a single day determine when the occasion of its application, its loop variable is also just determined, the occasion of namely using in this circulates is determined, its cycle index also just becomes constant, thereby becomes quiet cycle by dynamic circulation; The individual layer dynamic circulation refers to that cycle index is variable, and there is no nested other circulations; The multilayer dynamic circulation refers to that cycle index is variable, and is nested with interior loop.
In the present embodiment, when loop statement is quiet cycle, is mapped as control stream by this quiet cycle and comprises step:
S2411, according to cycle index, loop body is launched, obtain the new loop body with the cycle index equivalent number.Each new loop body comprises operation expression, and between each operation expression, common parameter object is arranged.
In the present embodiment, include three loop statements in function static void predict_16x16_dc:
static void predic_16x16_dc(uint8_t*src,int i_stride)
{
int dc=0;int i,j;
for(i=0;i<16;i++)
{dc+=src[-1+i*i_stride];dc+=src[i-i_stride];}
dc=(dc+16)>>5;
for(i=0;i<16;i++)
{
for(j=0;j<16;j++)
{src[j]=dc;}
src+=i_stride;
}
}
Thus, as can be known first loop unrolling is obtained new operation expression, is respectively:
dc=dc+src[-1];dc=dc+src[-i_stride];dc=dc+src[-1+i_stride];dc=dc+src[1-i_stride];......dc=dc+src[-1+15*i_stride];dc=dc+src[15-i_stride];dc=(dc+16)>>5。
Second and the 3rd loop unrolling obtain respectively operation expression and are:
src[0]=dc;
src[15]=dc;
src[0+i_stride]=dc;
src[15+i_stride]=dc;
src[0+15*i_stride]=dc;
src[15+15*i_stride]=dc;
S2412, according to parameter object, the new expression formula of launching to obtain is carried out iteration, thereby obtain a new operation expression.In the present embodiment, the new operation expression that is obtained by first above-mentioned loop unrolling carries out iteration according to index dc with these operation expressions as can be known, obtains a new expression formula about dc:
dc=(0+src[-1]+src[-i_stride]+src[-1+i_stride]+src[1-i_stride]+...+src[-1+15*i_stride]+src[15-i_stride]+16)>>5。
S2413, the operational order in will this new operation expression are mapped as the processing block diagram, and the parameter object in operation expression is mapped as memory node on data stream.In the present embodiment, when loop statement is when can be changed into the dynamic circulation statement of quiet cycle, in case determine, its loop variable just becomes constant due to the environment of its application, and its corresponding mapping step is identical with the mapping step of quiet cycle statement.
In the present embodiment, when loop statement was the mapping of individual layer dynamic circulation, the generation step of controlling stream comprised:
S2421a, circulating content is mapped as the processing block diagram, recursion instruction is mapped as state machine.The present embodiment is mapped as recursion instruction the mode of state machine, thereby data stream is divided into two states by respectively circulating content being mapped as the processing block diagram: order status (sequence) A and recurrent state (loop) A, as shown in figure 10.
In the present embodiment, when circulation was the multilayer dynamic circulation, the generation method of controlling stream comprised:
S2421b, respectively with the content map of interior loop statement and outer loop statement for processing block diagram A and processing block diagram B, be state machine B with outer cyclic mapping respectively, interior loop is mapped as state machine A, and this processing block diagram A and state machine B be mapped in and process B in block diagram, as shown in figure 11.That is to say in the present embodiment, adopt each circulation in the multilayer dynamic circulation is mapped as state machine, and the control that the nested mode of carrying out state machine generates corresponding Data Control stream is flowed.
Certainly in the present embodiment, it is the mode of a unified state machine that this multilayer dynamic circulation also can adopt interior loop and outer cyclic mapping, as shown in figure 12, if but nested circulation is the circulation of N layer, circulating content is mapped as the processing block diagram, be a unified state machine with cyclic mapping, and the status number of state machine is N+1.That is to say in the present embodiment, when the multilayer dynamic circulation adopted unified state machine, the status number of state machine equaled cycle index and adds one.
In the present embodiment, comprise branch's control statement according to the control statement in computer language procedure, and branch's control statement comprises single branch control statement and nested branch control statement.
In the present embodiment, when this branch's control statement is single branch control statement:
S2421b, branch's steering order is mapped as MUX.
S2422b processes block diagram with the input end that the control statement content map is MUX.S2423b, the control end that controlled condition is mapped as MUX are processed block diagram, obtain at last the structural drawing of this branch's control statement, as shown in figure 13.In the present embodiment, if this control statement content is mapped as memory node with it or/and controlled condition is variable.
In the present embodiment, when this branch's control statement is nested branch control statement, owing to function being analyzed before, therefore obtain upper strata control statement and the bottom control statement of this nested branch control statement:
S2421c, respectively upper strata branch steering order is mapped as MUX 1, the input end that the content map of upper strata branch control statement is MUX 1 is processed block diagram, and the control end that the controlled condition in upper strata branch control statement is mapped as this MUX 1 is processed block diagram.
S2422c, branch of lower floor steering order is mapped as MUX 2, be the input end processing block diagram of MUX 2 with the content map of branch of lower floor control statement, the control end that the controlled condition in branch of lower floor control statement is mapped as MUX 2 is processed block diagram.
S2423c selects multichannel 1 output as the input of MUX 2, thereby obtains the structural drawing of this nested branch control statement, as shown in figure 14.
Certainly in the present embodiment, if this control statement content or/and controlled condition is variable, is mapped as it corresponding memory node on data stream.
Based on above-mentioned method, thereby obtain respectively corresponding to function X264_me_search and function pixel_sad_16 * 16, the Data Control flow graph, as Figure 15 and shown in Figure 16.
Control expression formula conversion method in the present embodiment is the core of all linguistic expression's conversions, and its conversion efficiency directly affects the data volume of generating algorithm Data Control flow graph.The conversion method target that the present embodiment proposes is the hardware realization, therefore in advance concurrency factor, hardware configuration such as state machine, MUX etc. is taken into account, can help to a greater extent the Hardware Engineer to carry out hardware design.
Data Control flow graph based on said method generates is described in more detail generating operator space-time drawing generating method below in conjunction with specific embodiment.
Transfer principle by Data Control flow graph generating operator spacetime diagram is: computing comprises 1 with the storage implementation structure, and same storer is shared in input and output, and the arithmetic section restructural is namely based on the data stream form of processor; 2 pipeline data stream forms; 3 parallel data stream forms.Wherein first kind form is general form, and its data stream form is serial in time, and will be according to the sequential demand with its parallelization in from the higher level lanquage to the hardware conversion.
Therefore, the principle of in the present embodiment, data stream being launched is:
I, order related data flow structure: utilize operator unit expansion, this expansion can be implemented streamline, and flow beat was calculated according to the longest processing time, as data storage between streamline, adjusts the streamline beat by storer.For example, the execution time of FunA, Func B and Func C is respectively 3,4 and 5 unit interval; The input bandwidth of Func A is 3data/cycle, output bandwidth is 2data/cycle, the input bandwidth of FuncB is 5data/cycle, output bandwidth is 4data/cycle, the input bandwidth of Func C is 7data/cycle, output bandwidth is 6data/cycle, and raw data A_OUT, B_IN, B_OUT, C_IN, C_OUT and A_IN are shared same storage, and arithmetic section is reconstruct.Data stream relation: A_OUT=B_IN is when B_OUT=C_IN, and C_OUT ≠ A_IN, be that the output data of Func A are as the input of Func B, the output data of Func B are as the input of Func C, and do not have the feedback of data, and this data stream can be converted into pipeline organization.Wherein, flow beat was calculated according to the longest processing time, as interrupting, adjusted the streamline beat, as shown in figure 17 by storage operators.
The data stream of II, existence feedback: when data stream is a circulation, can not be converted into pipeline organization for the data stream that data dependence is arranged, if have data dependence but be stored between each batch data of same shared storage, and each batch data inside is not when there is no data dependence, the inside flowing water of each batch data can be realized processing, the bandwidth of storage can be reduced like this.For example, data A_IN is by A_IN_0, A_IN_1 and A_IN_2 form, although A_IN integral body depends on the output of C_OUT, A_IN_0, there is no data dependence between A_IN_1 and A_IN_2, therefore can utilize inner flowing water with A_IN_0, A_IN_1 and A_IN_2 do in batches, obtain being FuncA after complete C_OUT again, thereby obtain local flow's waterline formal transformation structure, as shown in figure 18.
III, parallel data flow structure: owing to there is no the inputoutput data correlativity, the hardware independent expansion that can walk abreast.For example, as A_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, B_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, during C_IN ≠ A_OUT ≠ B_OUT ≠ C_OUT, be Func A, Func B, any input of Func C is all uncorrelated with their output, thereby data stream is expanded into parallel form, as shown in figure 19.
Based on above-mentioned principle, be converted to corresponding operator spacetime diagram by the Data Control flow graph and specifically comprise step:
S31, set up the operator cell library in advance.
S32, according to the mark shared storage information and distributed store information, inputoutput data and intermediate data are stored.
S33 when in the Data Control flow graph during existence order related data flow, expands into pipeline organization with this data stream.
S34, take out in the operator cell library with launch after the operator unit of each node corresponding function.Wherein, when the processing capacity that the node in the Data Control flow graph carries out is simple, only need an operator or two operators in corresponding operator cell library, when yet the processing of carrying out when its functional module is more complicated, need to be corresponding to a plurality of operators in the operator cell library, and replace corresponding node in the Data Control flow graph with the combination of these a plurality of operators.In the present embodiment, arithmetic logic L0 and the L1 of function x264_me_search are respectively: L0 logic: bmx=x264_clip3 ((m->mvp[0]+2)>>2 ,-m->i_mv_range, m->i_mv_range); L1 logic: bmy=x264_clip3 ((m->mvp[1]+2)>>2 ,-m->i_mv_range, m->i_mv_range).
In the present embodiment, logic L0 and logic L1 are order related data flow structure, therefore it are expanded into the operator spacetime diagram as shown in figure 20, wherein, the configuration signal that X0 and X1 produce for controlling operator, it is controlled stream and dots.Rectangle in figure is processed computing operator or the storage operators of block diagram representative mapping, and their interconnection is completed by the link operator.
In the present embodiment, not only there is the alphabetic data dependency structure between data stream, also may there be the feedback of data stream, execution in step: S35, according to storage information, judgement is stored between the data of each batch data inside of same shared storage whether have data dependence, if there is no data dependence, can realize processing the inside flowing water of each batch data, thereby reduce the bandwidth of storage.
In the present embodiment, the arithmetic logic L3 of function x264_me_search comprises two parts, is respectively: first:
bcost=h->pixf.sad[i_pixel](m->p_fenc,m->i_stride,p_fref,(m->i_stride)*5);
for(i_iter=0;i_iter<16;i_iter++)
{int best=0;int cost[4];
(cost[0])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(-1)*m->i_stride*5+(0)],m->i_stride*5)+m->lm*
(bs_size_se(((bmx+(0))<<2)-m->mvp[0])+bs_size_se(((bmy+(-1))<<2)-
m->mvp[1]));
(cost[1])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(1)*m->i_stride*5+(0)],m->i_stride*5)+m->lm*
(bs_size_se(((bmx+(0))<<2)-m->mvp[0])+bs_size_se(((bmy+(1))<<2)-
m->mvp[1]));
(cost[2])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(0)*m->i_stride*5+(-1)],m->i_stride*5)+m->lm*
(bs_size_se(((bmx+(-1))<<2)-m->mvp[0])+bs_size_se(((bmy+(0))<<2)-
m->mvp[1]));
(cost[3])=h->pixf.sad[i_pixel](m->p_fenc,
m->i_stride,&p_fref[(0)*m->i_stride*5+(1)],m->i_stride*5)+m->lm*
(bs_size_se(((bmx+(1))<<2)-m->mvp[0])+bs_size_se(((bmy+(0))<<2)-m->mvp[1]));
Second portion:
if(cost[1]<cost[0]) best=1;
if(cost[2]<cost[best])best=2;
if(cost[3]<cost[best])best=3;
if(bcost<=cost[best])
break;
bcost=cost[best];
Wherein, first is the order related data flow, existence order related data flow in each batch data in second portion, and there is feedback, at first generate inner streamline, feeding back conversion, two parts are converted to the spacetime diagram that is comprised of the operator unit, respectively as shown in Figure 21 a and Figure 21 b.
In the present embodiment, be stored between each batch data stream in same shared storage and also may do not have data dependence, namely separate between each batch data stream, with this data stream parallel expansion.
S36 carries out parallel pipeline with data stream.In the present embodiment, the arithmetic logic L5 of function x264_me_search is: m->mv[0]=bmx<<2; M->mv[1]=bmy<<2; This logic L5 is parallel pipeline, and its generating operator spacetime diagram as shown in figure 22.
Through mentioned above principle, by the operator spacetime diagram of the Data Control flow graph conversion of function x264_me_search and function pixel_sad_16 * 16, respectively as shown in Figure 23 a and Figure 23 b.Wherein, function pixel_sad_16 * 16 have been called in original program due to function X264_me_search, so in Figure 23 a, do not have the bright concrete operator spacetime diagram that indicates function pixel_sad_16 * 16, the operator spacetime diagram of function pixel_sad_16 * 16 is as shown in Figure 23 b.
Based on above-mentioned operator space-time drawing generating method, the present invention also provides a kind of spacetime diagram generating apparatus.Please refer to Figure 24, be the structured flowchart of a kind of embodiment of spacetime diagram device of the present invention.
The spacetime diagram generating apparatus of the present embodiment comprises: operator cell library 1 is used for the operator unit that storage can realize calculation function; Operator spacetime diagram generation module 2, be used for according to data dependence, the Data Control flow graph being launched, and take out at least one operator unit of corresponding function according to the function treatment that each node after launching carries out from the operator cell library, each functional module in the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit, and at first this operator spacetime diagram generation module 2 is converted to the operator spacetime diagram with the Data Control flow graph of upper layer functions, and then the Data Control flow graph of lower layer functions is converted to the operator spacetime diagram.
please refer to Figure 25, structured flowchart for a kind of embodiment of the spacetime diagram generation module 2 of the present embodiment, the spacetime diagram generation module 2 of the present embodiment comprises: pipeline organization is processed submodule 21, be used for judging whether the data stream of described Data Control flow graph is order related data flow structure, in this way, adopt the mode of streamline to launch described order related data flow, and convert the operator spacetime diagram that is formed by the operator unit to, feedback data stream is processed submodule 22, be used for judging whether the data stream of described Data Control flow graph exists feedback, if there is feedback, described feedback data stream processes between submodule each data stream for judgement existence feedback whether have data dependence, if there is data dependence, be used for whether having data dependence between the data of each data stream inside that there are data dependence in judgement, if do not have data dependence between the data of described each data stream inside, described feedback data stream processing submodule is used for not existing each data stream of data dependence to adopt the mode of local flow's waterline to launch described internal data, and be converted to the operator spacetime diagram that is formed by the operator unit, parallel organization is processed submodule 23, for whether having data dependence between the data stream that judges described Data Control flow graph, if there is no data dependence, described parallel data stream is processed submodule and is used for adopting parallel mode to launch described data stream, and is converted to the operator spacetime diagram that is comprised of the operator unit.
Data Control flow graph based on above-mentioned method generates describes in detail to the method that this Data Control flow graph applies temporal constraint below in conjunction with specific embodiment.
The data flow graph is applied the method for temporal constraint, is divided into two stages:
I, the requirement specification definition of determining algorithm and target integrated circuit technology, and adopt different operators basic sequential units according to different process.II, take function as unit, the data flow graph is carried out the mark of temporal constraint, can obtain from the temporal constraint of the downward layer functions of top layer according to its data dependence and variable storage information.
Based on mentioned above principle, in the present embodiment, the method that data control flow graph is applied temporal constraint comprises step:
S41, determine total temporal constraint according to the requirement of user specification demand and target integrated circuit technology.In the present embodiment, the X264 code is described for the c program of algorithm H.264, if the algorithm requirements specification is defined as 720p@60fps (video resolution is 1280 * 720 pixels, processes 60 frame video datas p.s.).If the target integrated circuit technology is 130nm technique, call the operator technology library, can obtain standard unit's sequential is 5ns, but namely utilizes operator technology library generative circuit dominant frequency to be 200MHz.If the target integrated circuit technology is 65nm technique, call the operator technology library, can obtain standard unit's sequential is 2.5ns, but namely utilizes operator technology library generative circuit dominant frequency to be 400MHz.
In the present embodiment, after obtaining total temporal constraint according to step S41, different data flow architecture in data control flow graph is carried out temporal constraint comprise two kinds of methods, be respectively according to parallel data stream and carry out temporal constraint; Carry out temporal constraint according to serial data stream.
When the data stream in the Data Control flow graph was parallel data stream, step S42 comprised step:
S421a, total temporal constraint is divided equally each the operator level in the spacetime diagram of correspondence, and divided the temporal constraint of each operator level equally in this operator level each operator unit.In the present embodiment, so that H.264 algorithm is as example, in the frame level was processed, its constraint came from the algorithm requirements specification and is defined as 720p@60fps, and namely per second is processed 60 frames.If selecting technique is 130nm technique, but utilize operator technology library generative circuit dominant frequency to be 200MHz, the temporal constraint that namely obtains is 200MHz/60 frame=3.33M cycle/ frame.At the macroblock level Data processing, because each macro block processing sequence is serial processing, therefore Timing Constraints distributes according to serial, be that every frame is comprised of 80 * 45=3600 macro block, the processing power that requires each macro block is 1280 * 720 * 60/256=216000MB/s, when each macro block is processed required operator unit, ordinal number is 200MHz/216000MB/s=926cycle/MB, and the sequential that is about to the computing of frame level is divided equally to each macro block: 3.33M cycle/ frame/3600=926cycle/MB.Because the macro block in video encoder is divided into inter prediction, infra-frame prediction, transform and quantization, entropy coding, block elimination filtering etc., these macro blocks are processed a kind of parallel form that is operating as, therefore the temporal constraint of each macro block is also 926cycle/MB, thereby has generated successively the temporal constraint of bottom layer treatment module downwards.
In the present embodiment, when serial data was carried out temporal constraint, step S42 comprised step:
S421b, the basic sequential unit of the operator that corresponding each operator level of each node of serial in Data Control stream is total is as overall temporal constraint, and the ratio of the sequential summation that the sequential of the computing operator that shines upon according to the longest arithmetic path in each operator level accounts for that in each operator level, the longest arithmetic path shines upon operator unit is corresponding is distributed the sequential of each operator level.For example the pass in the Data Control flow graph is serial as two operator level A and B, and their overall temporal constraint be they N basic sequential of operator with.Sequential corresponding to the computing operator that the longest arithmetic path was shone upon in operator level A and the B be as benchmark, draws ratio in this benchmark and distribute N operator basic sequential unit.The computing operator number that for example in operator level A, the longest arithmetic path shines upon is Ma operator, the basic sequential unit of a corresponding Na operator of institute; The computing operator number that the longest arithmetic path of operator level B shines upon is Mb operator, the basic sequential unit of a corresponding Nb operator of institute.The temporal constraint that is assigned to so on modules A is:
Figure BDA0000042572150000161
The temporal constraint that is assigned on operator level B is:
Figure BDA0000042572150000162
In the present embodiment, the computing operator can be that many basic sequential units carry out, so the operator number is not necessarily consistent with operator basic sequential unit.
S422b, serial data stream contains the sign of parallel data in the Data Control flow graph, according to the sign of this parallel data, temporal constraint in step S421b is revised.For example, modules A can obtain information p-x-n-m, and wherein p is the concurrency sign, and x is the concurrency kind, comprises the types such as pipeline parallel method, local pipeline parallel method, data parallel; N is parallel position, namely walks abreast n the processing stage; M is and line number.According to above information, the temporal constraint of modules A can be adjusted into
Figure BDA0000042572150000163
β wherein nFor being in the empirical parameter of stage n parallelization, N pnFor being in the basic sequential unit of processing operator of stage n.
Based on above-mentioned method, the structure during to the structure shown in Figure 23 a and Figure 23 b after row sequential mark is as shown in Figure 26 a and Figure 26 b.
Called the Operator structure of function pixel_sad_16 * 16 due to structure shown in Figure 23 a, so first structure shown in Figure 26 b is compressed, the result after compression as shown in figure 27; And then the row compression during to structure shown in Figure 26 a according to the method that spacetime diagram shown in Figure 27 is compressed, thereby the spacetime diagram of the function X264_me_search after being compressed.
When structure shown in 23a and Figure 23 b being carried out the spacetime diagram compression, mainly follow following principle:
1, the computing class operator that in the operator spacetime diagram, operational attribute is identical is carried out the cluster compression.Such as, two parallel add operation operators can be compressed into an add operation operator in spacetime diagram, the addition operator after the mode of controlling operator by introducing simultaneously realizes compressing multiplexing, complete and compress before two functions that addition operator is identical.This shows, after the operator spacetime diagram was compressed, the number of operator can significantly reduce, thereby had saved the area of integrated circuit, and correspondingly, the operator after compression is realized multiplexing by the control operator, has increased the execution time of integrated circuit overall algorithm.Be understandable that, cluster compression to the computing class operator must cause that storage class operator, control class operator, class of paths operator and clock class operator also correspondingly change, so can also do corresponding cluster compression with further saving integrated circuit area according to actual conditions to above-mentioned operator, storage class operator especially wherein.
2, when introducing the control operator, generate corresponding configuration-direct, described configuration-direct is used for the operator of control generation works according to predetermined mode, thus realization and the identical function that compresses pre-operator.
3, have multiple for the possible cluster compression result of same operator spacetime diagram.Therefore, in compression process, select spacetime diagram overall algorithm execution time after compression near the spacetime diagram of confinement time as final compression result, select the overall algorithm execution time near the spacetime diagram of confinement time as compression result, can in the situation that the sequential condition is satisfied in assurance, save the area of integrated circuit the largelyst.Be the integrated circuit maximum execution time that calculates according to the performance index that the user proposes confinement time.
Above-mentioned to after the compression of spacetime diagram cluster, can reduce area and the power consumption of integrated circuit.And the operator that generates after the cluster compression has certain regularity.
Above-mentioned spacetime diagram is carried out cluster compression after, in the time of can also be to wherein some operator, row be optimized, a kind of mode of optimization is that row solidifies customization during to some operator.Such as, in Figure 28, the left side is the computing class operator after a kind of compression, due to not use of logic unit wherein, so can obtain the Operator structure shown in the right in Figure 28 after the logic unit removal with this operator, has dwindled further the area of operator.Again such as, for ADDS operator shown in Figure 2, due to the value by change control bit X, can make this operator realize different functions, and in a certain concrete integrated circuit, in fact only used the addition shifting function of this operator, the value of the control bit X of this operator can be fixed as 000, thereby satisfy the power consumption of having saved integrated circuit under the prerequisite of functional requirement.The minimal hardware of identical function operation only appears realizing in can circuit in the method that customizes by curing, then these minimal hardware is carried out full Custom Design, makes it there is no other expanded function.Like this, both can guarantee the correct execution of algorithm, can optimize again area and the power consumption of integrated circuit.
After spacetime diagram before and after the compression of X264_me_search function cluster is analyzed, can obtain form shown in Figure 29.Can find out from this form, through after cluster compression, operator number used reduces to 139 by 1724, although 500 cycles being increased to from 292 cycles of sequential correspondingly, 500 cycles are still less than constraint cycle 600 of X264_me_search.This shows, after the cluster compression, satisfy under the condition of temporal constraint in the execution time that guarantees overall algorithm, can also reduce significantly operator number used, thereby save area and the power consumption of integrated circuit.
based on above-mentioned method, by the X264_me_search function with the H.264 C language description of standard is carried out process analysis, and obtain the Data Control flow graph of function X264_me_search as shown in figure 15 after shining upon, at the operator spacetime diagram that is converted to by this Data Control flow graph as shown in Figure 25 a and Figure 25 b, according to information such as data dependences, data being controlled flow graph carries out temporal constraint and obtains temporal constraint figure as shown in Figure 26 a and Figure 26 b, then according to time-labeling, spacetime diagram is carried out spacetime diagram after cluster compression that cluster compression obtains function pixel_sad_16 * 16 as shown in figure 27, last again by the spacetime diagram generation integrated circuit lower hardware circuit after this compression.Due in to the computer language procedure analytic process, but obtain data dependence, data concurrency and corresponding control information, and this program is mapped as the Data Control flow graph, thereby but make the Data Control flow graph that obtains comprise data dependence, data concurrency and corresponding control information, thereby the ancillary hardware circuit designer is carried out circuit design effectively.And the device of the present embodiment can realize that with the computer language procedure automatic mapping be the Data Control flow graph, thereby is converted to the operator spacetime diagram again, greatly promotes integrated circuit Design of Hardware Architecture efficient.
Space-time map generalization method provided by the invention not only is adapted in above-described embodiment, also can be used for other and is generated the mapping process of integrated circuit lower hardware logical description by computer language procedure.
Above content is in conjunction with concrete embodiment further description made for the present invention, can not assert that concrete enforcement of the present invention is confined to these explanations.For the general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims (10)

1. lower hardware mapping method of integrated circuit is characterized in that comprising:
The process analysis step, be used for reading the computer language procedure of describing the integrated circuit algorithm, identify mapped execution object and parameter object according to the rule of this computerese from described computer language procedure, this execution object comprises operational order and/or steering order, and this parameter object comprises a kind of in input data, output data, intermediate data;
The Data Control flow graph generates step, be used for the respective nodes that the execution object that will identify and parameter object are mapped to the Data Control flow graph of describing the integrated circuit algorithm, wherein, described operational order is mapped as the processing block diagram, described steering order is mapped as for the control of identification-state, state transitions condition and state control signal stream, described input data, output data and intermediate data are mapped as memory node on data flow diagram;
The operator spacetime diagram generates step, be used for function treatment that each node according to the Data Control flow graph carries out and take out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit;
The temporal constraint step is used for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, and each operator unit label time in the operator spacetime diagram carries out temporal constraint to each level of operator spacetime diagram;
The spacetime diagram compression step is used for according to time-labeling, the cluster that the operator spacetime diagram carries out on the space being compressed, and makes it overall algorithm execution time close to total temporal constraint;
The lower hardware mapping step generates integrated circuit lower hardware circuit logic according to the operator spacetime diagram after the cluster compression and describes.
2. the method for claim 1, is characterized in that, described operator spacetime diagram generates step and comprises:
The Data Control flow graph is launched according to its data flow dependency;
The function treatment of carrying out according to each node after launching is taken out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is comprised of the operator unit.
3. method as claimed in claim 2, it is characterized in that, data stream in described Data Control flow graph is order related data flow structure, adopts the mode of streamline to launch described order related data flow, and converts the operator spacetime diagram that is comprised of the operator unit to.
4. method as claimed in claim 2, it is characterized in that, there is feedback in data stream in described Data Control flow graph, and there is not data dependence between described data stream internal data, described internal data is not existed each data stream of data dependence to adopt the mode of local flow's waterline to launch, and convert the operator spacetime diagram that is formed by the operator unit to.
5. method as claimed in claim 2, is characterized in that, do not have data dependence between the data stream in described Data Control flow graph, adopts parallel mode to launch described data stream, and convert the operator spacetime diagram that is comprised of the operator unit to.
6. integrated circuit lower hardware mapping device is characterized in that comprising:
The process analysis module, be used for reading the computer language procedure of describing the integrated circuit algorithm, identify mapped execution object and parameter object according to the rule of this computerese from described computer language procedure, this execution object comprises operational order and/or steering order, and this parameter object comprises a kind of in input data, output data, intermediate data;
Data Control flow graph generation module, be used for the corresponding node that the execution object that will identify and parameter object are mapped to the Data Control flow graph of describing the integrated circuit algorithm, wherein, described operational order is mapped as the processing block diagram, described steering order is mapped as for the control of identification-state, state transitions condition and state control signal stream, described input data, output data and intermediate data are mapped as memory node on data flow diagram;
Operator spacetime diagram generation module, be used for function treatment that each node according to the Data Control flow graph carries out and take out at least one operator unit of corresponding function from the operator cell library of setting up in advance, the Data Control flow graph is converted to the operator spacetime diagram that is formed by the operator unit;
The temporal constraint module is used for determining total temporal constraint according to the requirement of user specification demand and target integrated circuit technology, and each operator unit label time in the operator spacetime diagram carries out temporal constraint to each level of operator spacetime diagram;
The spacetime diagram compression module is used for according to time-labeling, the cluster that the operator spacetime diagram carries out on the space being compressed, and makes it overall algorithm execution time close to total temporal constraint;
The lower hardware mapping block generates integrated circuit lower hardware circuit logic according to the operator spacetime diagram after the cluster compression and describes.
7. device as claimed in claim 6, is characterized in that, described operator spacetime diagram generation module comprises:
The operator cell library is used for the operator unit that storage can realize calculation function;
Launch the unit, be used for the Data Control flow graph is launched according to its data flow dependency;
Described operator spacetime diagram generation module specifically is used for the function treatment of carrying out according to each node after launching and takes out at least one operator unit of corresponding function from the operator cell library, each functional module in the Data Control flow graph is converted to the operator spacetime diagram that is comprised of the operator unit.
8. device as claimed in claim 7, is characterized in that, described operator spacetime diagram generation module also comprises the judging unit for judgement data flow dependency type, and described expansion unit carries out the corresponding operation that launches according to the data flow dependency type.
9. device as claimed in claim 8, is characterized in that, when the data stream in judgment unit judges Data Control flow graph was order related data flow structure, described expansion unit adopted the mode of streamline to launch described order related data flow; When having feedback in the data stream in described judgment unit judges Data Control flow graph, described expansion unit judges exists between the data of each data stream inside of feedback whether have data dependence, if do not have data dependence between the data of described each data stream inside, described expansion unit is used for not existing each data stream of data dependence to adopt the mode of local flow's waterline to launch described internal data; When not having data dependence between the data stream in described judgment unit judges Data Control flow graph, described expansion unit adopts parallel mode to launch described data stream.
10. device as described in any one in claim 6-9, it is characterized in that, described operator spacetime diagram generation module first is converted to the operator spacetime diagram with the drawing of the data-flow-control of upper layer functions, and then the data-flow-control drawing with the subfunction of lower floor's function call is converted to the operator spacetime diagram.
CN 201010619832 2010-12-31 2010-12-31 Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device Active CN102054107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010619832 CN102054107B (en) 2010-12-31 2010-12-31 Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010619832 CN102054107B (en) 2010-12-31 2010-12-31 Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device

Publications (2)

Publication Number Publication Date
CN102054107A CN102054107A (en) 2011-05-11
CN102054107B true CN102054107B (en) 2013-11-06

Family

ID=43958420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010619832 Active CN102054107B (en) 2010-12-31 2010-12-31 Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device

Country Status (1)

Country Link
CN (1) CN102054107B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043886B (en) * 2010-12-31 2012-10-24 北京大学深圳研究生院 Underlying hardware mapping method for integrated circuit as well as time sequence constraint method and device for data control flow
CN102054109B (en) * 2010-12-31 2014-03-19 北京大学深圳研究生院 Lower hardware mapping method of integrated circuit, and data control flow generation method and device
CN108170957B (en) * 2017-12-28 2022-01-04 佛山中科芯蔚科技有限公司 Method and system for generating data control flow diagram and integrated circuit design method
CN111208994B (en) * 2019-12-31 2023-05-30 西安翔腾微电子科技有限公司 Execution method and device of computer graphics application program and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7530047B2 (en) * 2003-09-19 2009-05-05 Cadence Design Systems, Inc. Optimized mapping of an integrated circuit design to multiple cell libraries during a single synthesis pass
CN101901161A (en) * 2010-07-21 2010-12-01 四川大学 Energy consumption related software/hardware partition-oriented hierarchical control and data flow graph modeling method
CN102043886A (en) * 2010-12-31 2011-05-04 北京大学深圳研究生院 Underlying hardware mapping method for integrated circuit as well as time sequence constraint method and device for data control flow
CN102054109A (en) * 2010-12-31 2011-05-11 北京大学深圳研究生院 Lower hardware mapping method of integrated circuit, and data control flow generation method and device
CN102054108A (en) * 2010-12-31 2011-05-11 北京大学深圳研究生院 Lower hardware mapping method of integrated circuit, and time-space diagram compression method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4397744B2 (en) * 2004-06-25 2010-01-13 パナソニック株式会社 Method for high-level synthesis of semiconductor integrated circuit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7530047B2 (en) * 2003-09-19 2009-05-05 Cadence Design Systems, Inc. Optimized mapping of an integrated circuit design to multiple cell libraries during a single synthesis pass
CN101901161A (en) * 2010-07-21 2010-12-01 四川大学 Energy consumption related software/hardware partition-oriented hierarchical control and data flow graph modeling method
CN102043886A (en) * 2010-12-31 2011-05-04 北京大学深圳研究生院 Underlying hardware mapping method for integrated circuit as well as time sequence constraint method and device for data control flow
CN102054109A (en) * 2010-12-31 2011-05-11 北京大学深圳研究生院 Lower hardware mapping method of integrated circuit, and data control flow generation method and device
CN102054108A (en) * 2010-12-31 2011-05-11 北京大学深圳研究生院 Lower hardware mapping method of integrated circuit, and time-space diagram compression method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王新安.算子设计方法缩小IC设计与制造间的"剪刀差".《集成电路应用》.2010,(第7期),全文.

Also Published As

Publication number Publication date
CN102054107A (en) 2011-05-11

Similar Documents

Publication Publication Date Title
CN102043886B (en) Underlying hardware mapping method for integrated circuit as well as time sequence constraint method and device for data control flow
KR102258414B1 (en) Processing apparatus and processing method
CN102054108B (en) Lower hardware mapping method of integrated circuit, and time-space diagram compression method and device
CN102055981B (en) Deblocking filter for video coder and implementation method thereof
CN102054109B (en) Lower hardware mapping method of integrated circuit, and data control flow generation method and device
Catthoor et al. Application-specific architectural methodologies for high-throughput digital signal and image processing
CN102054107B (en) Lower hardware mapping method of integrated circuit, and space-time diagram generation method and device
CN101860752A (en) Video code stream parallelization method for embedded multi-core system
Hameed et al. Understanding sources of ineffciency in general-purpose chips
Li et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration
Chiu et al. Flexibility: FPGAs and CAD in deep learning acceleration
CN113158599B (en) Quantum informatics-based chip and chip-based EDA device
Sau et al. Automated design flow for coarse-grained reconfigurable platforms: An RVC-CAL multi-standard decoder use-case
CN102572415B (en) Method for maping and realizing of movement compensation algorithm on reconfigurable processor
Muñoz-Martínez et al. A novel network fabric for efficient spatio-temporal reduction in flexible DNN accelerators
US20230076473A1 (en) Memory processing unit architecture mapping techniques
CN102075762B (en) Inter-frame predictor circuit for video encoder and method for implementing same
CN102055980B (en) Intra-frame predicting circuit for video coder and realizing method thereof
CN102075763A (en) Intra-frame sub-block predictor circuit for video encoder and method for implementing same
Doan et al. Multi-asip based parallel and scalable implementation of motion estimation kernel for high definition videos
Dasu et al. Reconfigurable media processing
CN103136162B (en) Cloud framework and the method for designing based on this framework in ASIC sheet
Kim et al. Designing real-time h. 264 decoders with dataflow architectures
Zhai Adaptive streaming applications: analysis and implementation models
Goldbrunner et al. Memory access pattern profiling for streaming applications based on MATLAB models

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant