Background technology
Configurable calculating is called self-adaptation again calculates, and is to be had the processing unit (PE) of reconfigurable function and can be disposed the data processing structure that the connection line of different pieces of information direction of transfer constitutes together by a plurality of.At present, configurable calculating mainly is divided into two classes: fine granularity configuration and coarseness configuration.The fine granularity configuration is calculated and mainly is meant field programmable gate array FPGA, it uses very extensive at aspects such as digit chip exploitation and system designs, but owing to fine-grained reason, its actual logical units only accounts for 10% of actual chips area, all the other are occupied by switch, RAM, routing network, power consumption and frequency of operation are not very desirable, are handling some regular computings, and efficient is not high during as multiplication.Relative, the granularity of the configurable calculating of coarseness is generally 8,16,32 etc., is fit to very much carry out the algorithm level operation.
Enter the nineties in last century, development along with VLSI (very large scale integrated circuit) VLSI technology, the configurable computation structure of coarseness based on programmable switch (program switch) constantly is developed, and has shown outstanding performance and potentiality at aspects such as image filtering, feature extraction, target recognition and tracking, communication algorithms.What table 1 was listed is that the configurable computational item technology of part coarseness is summed up in the overseas university.
Table 1 off-the-shelf item technology is summed up
Project name |
Structure |
Granularity |
Tissue |
Application target |
PADDI |
Cross bar |
16 |
Cross bar |
DSP |
PADDI-2 |
Cross bar |
16 |
Cross bar |
DSP |
KressArray |
2-ties up grid |
32 |
The NN﹠ bus sectionalization |
Self-adaptation |
RaPID |
1-ties up array |
16 |
The sectional bus |
Streamline |
Matrix |
2-ties up grid |
8 |
8NN, global lines |
Common |
RAW |
2-ties up grid |
32 |
8NN, switch connects |
Common |
GarP |
2-ties up grid |
2 |
Overall and accurate global lines |
Loop statement |
Pleiades |
Grid/cross bar |
Many granularities |
The cross bar of sectional |
Multimedia |
PipeRench |
1-ties up array |
128 |
|
Streamline |
REMARC |
2-ties up grid |
16 |
NN, the overall length bus |
Multimedia |
MorphoSys |
2-ties up grid |
16 |
NN, length 2﹠3 line segment |
Multimedia |
CHESS |
The sexangle grid |
4 |
8NN, bus |
Multimedia |
DreAM |
2-ties up array |
The 8﹠16 position |
NN, segmented bus |
Wireless telecommunications |
CS2000 |
2-ties up array |
The 16﹠32 position |
The heterogeneity array |
Communication |
In these projects,, can roughly be divided into two classes by the data input/output structure of respective handling unit forming array though the inner reconfigurable processing unit 26S Proteasome Structure and Function that relates to has nothing in common with each other:
(1) realizes that by global bus's controller or cross bar (crossbar) reconfigurable pe array is connected with the input/output port of outside.In this manner, structure allows according to different configuration needs data bus in certain processing unit to be linked to each other with outside port.Adopt the structure of this mode to mainly contain: PADDI-1, PADDI-2, REMARC, COLT, KressArray, PipeRench.
(2) the data input/output bus of reconfigurable processing unit self directly links to each other with external interface.Adopt the structure of this mode to mainly contain: GarP, RAW, MorphoSys, CHESS, RaPiD.
For these two kinds of implementations, all there is deficiency separately.For first kind of mode, according to the control of global bus's controller or cross bar, outside port links to each other with inner certain processing unit data bus in the pe array, and the outside port data are handled the unit I/O by this.But, be divided into one group (c), when pe array mapping (a*b+c) function, this input/output structure then can't be finished for a, b as per three of the data of serial input in order in the face of the situation below similar.For second kind of structural design mode, owing to be directly the data of processing unit to be drawn, the required bit wide of external interface will increase along with the increase of the bit wide of the processing unit number of output data bus and data bus.With 16 MorphoSys is example, the interface of its 8*8 pe array and microprocessor needs 256, if pe array changes 12*12 into, then the interface with microprocessor needs 384, and this makes shared chip area in interface section and control complexity increase.
Summary of the invention
Technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, data input/output structure in the reconfigurable computation structure of a kind of coarseness is provided, both be beneficial to the input and output of data in the mapping algorithm, reduced input/output port again.
Technical solution of the present invention: data input/output structure in the reconfigurable computation structure of coarseness comprises: divergent function DEMUX module, pooling function MERGE module, RAM function control device, address generator, RAM and cross bar C1, C2, C3; Cross bar C1 selects the data bus of input for the DEMUX module, selected data bus comprises: input data bus, the input of adjacent PEA data and RAM input data, data after being chosen by C1 are input to the DEMUX module, and the data input processing unit array PEA after the decentralised control of DEMUX module handles; Merge into a data bus output from the multiplex data bus of pe array PEA output through the MERGE module, these data are by cross bar C2 control back output data bus, or are input to RAM or output data to adjacent PEA by jointly controlling of cross bar C2 and cross bar C3; Address generator and RAM function control device are that data manipulation produces necessary address information and control signal among the correct execution RAM; The data that RAM reads allow the control by cross bar C3, are delivered to cross bar C1 or output data to adjacent PEA.
The present invention's advantage compared with prior art is:
(1) owing to after having merging and divergent function module, the data in N+1 the data bus are incorporated in input and output in the data bus, can obviously reduce outside port quantity.
(2) import data after the input of divergent function module, the data of order input can be distributed to different processing units, begin simultaneously to handle, and can realize similar following situation; The input data per three be divided into one group (a, b, c), in pe array the mapping (a*b+c) function.
(3) data of pe array PEA output are after merging functional module, and a plurality of input data allow order output from an output data bus, are convenient to carry out the RAM operation.
(4) the present invention adopts the data transfer path by cross bar control, and is convenient, flexible, can realize that not only data arrive the input and output of handling cell array, and can realize the transmission of data between pe array; Simultaneously, can set up the data routing of an input processing unit array for the data of reading in the RAM in the interface structure.
(5) in the address generator of the present invention the DMA function is integrated among the interface structure, has accelerated the input and output of data, be beneficial to the processing of data stream, give full play to the advantage that configuration is calculated.
Embodiment
As shown in Figure 1, the present invention mainly is made up of divergent function DEMUX module, pooling function MERGE module, RAM function control device, address generator, RAM and cross bar C1, C2, C3.In Fig. 1, each external interface signals line is defined as respectively:
● ' adjacent PEA data input '-adjacent PEA of expression is input to the data bus of this IO interface structure;
● ' output data is to the data bus of this IO interface structure of adjacent PEA '-expression to adjacent PEA output;
● ' output data bus '-for this reason interface structure is to the data bus of outside output;
● ' input data bus '-be outside data bus to this interface structure input;
● essential address bus when ' address bus '-for this reason interface structure reads external data random access memory (SRAM);
● RD, WR-input/output structure read-write external RAM data provide control signal for this reason.
Cross bar C1, C2, C3 control data bang path, for functional module is selected rational input data bus and output data direction, cross bar C1 selects the data bus of input for the DEMUX module, selected data bus comprises: input data bus, the input of adjacent PEA data and RAM input data, data after being chosen by C1 are input to the DEMUX module, and the data input processing unit array PEA after the decentralised control of DEMUX module handles; Merge into a data bus output from the multiplex data bus of pe array PEA output through the MERGE module, these data are controlled back output by cross bar C2, or by cross bar C3, are input to RAM or output data to adjacent PEA; Address generator and RAM function control device are that data manipulation produces necessary address information and control signal among the correct execution RAM, data manipulation at RAM mainly contains: dma mode is read, dma mode is write, carry out read operation with the address date according to input, comprise at the control signal of RAM: data write control signal WR and data are read control signal RD; The data that RAM reads feed back to cross bar C1 or output data to adjacent PEA by the control of cross bar C3, become RAM input data.
As shown in Figure 2, the divergent function DEMUX module among the present invention is made up of data counter, configuration code register, data storage location table, a N+1 code translator, a N+1 data register, maximum count value register, routing strobe circuit unit.The effect of configuration code register is a store configuration data in advance, is divided into two parts: high (m+1) bit is as first, remaining 4 * (N+1) bit, and one group of per 4 bit, as second portion, the pass of its N and m is: m=[log
2 (N+1)], the implication of [] is to round.Data storage location table inside is made up of N+1 register, and the data that deposit its inside refer to import the memory location of data.M bit data counter mainly is responsible for the valid data of input are counted; The maximum count value register-stored be the data bus number that the divergent function module is disperseed output.Data register x (x=0,1 ... N) under the effect of data output trigger pip, allow the data that will store in the internal register from output data bus x (x=0,1 ... N) output.
The first of configuration code register inside, second portion data connect with maximum count value register and data storage location epiphase respectively, outer input data is input to the routing strobe circuit switch of m bit data counter and each data register respectively, the input data number that m bit data counter will write down is input to data storage location table and maximum count value register simultaneously, be the stored position information of input data read internal register appointment, send into the decoding scheme of each routing strobe circuit correspondence.Decoding scheme is responsible for controlling the routing strobe circuit.M bit data counter links to each other with the maximum count value register, in case the input data number of record equates with the data of maximum count value register, then can produce control signal, this control signal one side is as the reset signal of m bit data counter, on the other hand as data output trigger pip trigger data register x (x=0,1 ... N) to external data output.
As shown in Figure 3,4 of divergent function DEMUX sequence of modules serial received of the present invention input data, and parallel simultaneously, 4 data that output receives in 3,4 from output data bus 1,2.
As shown in Figure 4, data pooling function MERGE module of the present invention is mainly by valid data banner word register, configuration code 1 register, configuration code 2 registers, data storage location table, maximum count value register, counter, ' 1 ' retainer, memory data register, routing strobe circuit and logic comparator circuit are formed.Valid data banner word register is made up of the N+1 position, every corresponding outer input data x of difference (x=0,1 ... N), when outer input data was effective, the relevant position ' 1 ' in the register, otherwise zero clearing.' 1 ' retainer is as the high level retainer, and under the effect of input trigger pip, high level is kept in output, up to there being reset signal to arrive, reverts to low level.Memory data register inside is made up of N+1 register, and storage outer input data simultaneously can walk abreast; And, read the data in the corresponding registers according to the address information of importing, export to the outside.Data storage location table inside is made up of N+1 register, and what the data that deposit its inside were indicated is the memory location of input data.Data counter mainly is responsible for the valid data of input are counted; The maximum count value register-stored be the pooling function module will merge the input the data bus number.What configuration code 1 register was stored in advance is the data bus information that will be merged output in the input data bus, form by the N+1 position, every the corresponding outer input data x of difference (x=0,1, ... N), when a certain position ' 1 ', illustrate that the outer input data of this correspondence will be merged output.Configuration code 2 registers configuration data stored in advance are divided into two parts: high (m+1) bit is as first, remaining 4 * (N+1) bit, and one group of per 4 bit, as second portion, the pass of its N and m is:
The valid data of outside input are when being input to memory data register, and the effective information in the data is input to valid data banner word register, it with configuration code 1 register of storage in advance in data through logic comparator circuit, form control signal.When this control signal is effective, trigger ' 1 ' retainer, enable counter.Counter is started from scratch clock signal is counted; And cut off outer input data routing strobe circuit simultaneously, stop this functional module to receive new outer input data.The count value of output is input to data storage location table and maximum count value register simultaneously.The data storage location table can read positional information in the internal register of input data indication, sends into memory data register, with to external data output.Counter links to each other with the maximum count value register, in case count value equates that with the data of maximum count value register then can produce control signal, recovering ' 1 ' retainer is low level state.
As shown in Figure 5, the data in 4 input data buss of pooling function MERGE module parallel receive of the present invention, and order serial output data from output data bus.
As shown in Figure 6, RAM function control device of the present invention is mainly by counter, length register, and DMA transmission data length config memory, RAM functional configuration storer, the function control register, read-write generator and logic comparator circuit are formed.Counter is responsible under the DMA working method clock of input being counted, and this clock signal outwards is input in the address generator as the incremental functionality trigger pip simultaneously.RAM functional configuration storer is one 3 a register, respectively with the function control register in every corresponding, the working method that definition module allows: DMA reads (read), and DMA writes (write), normal read (general read).The read-write generator is responsible for producing the read-write that reads external RAM.
DMA transmission data length config memory and RAM functional configuration storer configuration data stored in advance write length register and function control register respectively.Dma mode control bit in the function control register and clock letter signal are promptly outwards exported control signal through behind the logical combination, become the incremental functionality trigger pip, be input to counter again it is counted.Numerical value in the counter and length register be the input logic comparator circuit simultaneously.If relatively more consistent, then form the DMA end signal and outwards export; Otherwise, allow the read-write generator under dma mode, to produce read-write.The valid data banner word register that the outer input data bus forms forms reading RD, writing the WR control signal of outside output with clock signal through logical combination in that the read-write generator is inner under the normal read mode.
As shown in Figure 7, the address generator module among the present invention is mainly by the incremental functionality controller, DMA start address configuration register, and RAM functional configuration storer, function control register and valid data banner word register cell are formed.The incremental functionality controller is responsible for when dma mode is worked, and the data of initial input externally under the effect of incremental functionality trigger pip, are constantly added 1, and export from the OPADD bus; When normally reading and writing working method, then only be that the data of input are directly exported as the address.Initial address message (IAM) when DMA start address configuration register is responsible for disposing dma mode work; RAM functional configuration storer is one 3 a register, respectively with the function control register in every corresponding, the working method that definition module allows: DMA reads (read), and DMA writes (write), normal read (general read).
Data in outer input data bus and the DMA start address configuration register are input to inner MUX simultaneously, the data input incremental functionality controller that MUX is selected.The gating control signal of MUX mainly relies on two signals to realize: the control signal that forms through combinational logic with the exterior arrangement order that postpones input under the DMA working method; The data useful signal that the outer input data bus forms under the normal read mode.
Among Fig. 1, C1, C2, C3 are cross bar, the effect of playing is according to different configuration informations, the path that control data transmits, Fig. 8 is the structural drawing of the cross bar of the present invention's employing.Under the control action of these routing switch, can form data transfer path as shown in Figure 9 as required.
As shown in Figure 9, INx → DEMUXx, MERGEx → OUTx (x:1,2) are input, the output control of data; MERGE1 → RAM1, MERGE2 → RAM3 are the path from the pe array to RAM; RAM1 → DEMUX2, RAM3 → DEMUX2, RAM1 → DEMUX1, RAM15 → DEMUX1 are to the path of handling cell array from RAM; RAM15 → RAM1, RAM1 → RAM3 are the data transfer path between the adjacent R AM piece.INx → DEMUXx finishes the divergent function of data, and with the different situations of the data based configuration of outside input, the demultiplexing data are input to the PEA array simultaneously; MERGEx → OUTx finishes the pooling function of data, with the data that the PEA array is outwards exported, merges the back from a data bus output.RAM15 → RAM1, RAM1 → RAM3 in the data transfer, also can realize data indirect addressing function between realization RAM.With RAM15 → RAM1 is example, at first in RAM15, store the address information of preparing sense data in RAM1 in order, order reads the data among the RAM15 then, and passes to RAM1 as address information, and then the data content of RAM1 output promptly is the data of the desired address of the present invention.