CN203706197U

CN203706197U - Coarse-granularity dynamic and reconfigurable data regularity control unit structure

Info

Publication number: CN203706197U
Application number: CN201420060846.XU
Authority: CN
Inventors: 葛伟; 曹鹏; 马俊; 刘波; 杨锦江; 徐凯; 杨军; 王超; 卜爱国
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2014-02-10
Filing date: 2014-02-10
Publication date: 2014-07-09
Anticipated expiration: 2024-02-10

Abstract

The utility model discloses a coarse-granularity dynamic and reconfigurable data regularity control unit structure. According to the coarse-granularity dynamic and reconfigurable data regularity control unit structure, a data stream control module comprises a vector loading module, a vector phase-shifting module and an unpacking distribution module, wherein multi-layered combined two-level assembly lines are obtained by the three modules through double buffer registers, and synchronization of the assembly lines is achieved through hardware handshaking. The vector loading module has access to different data address spaces through dynamic and reconfigurable allocation so as to accomplish loading of data from a memory to a vector data register file; the vector phase-shifting module achieves operations such as shifting and splicing of data streams in a dynamic and reconfigurable allocation mode, and output data are written in a vector phase-shifting register file; the unpacking distribution module achieves a distribution operation of register data through allocation, and the requirement of an array for concurrent input of calculating data is met. The coarse-granularity dynamic and reconfigurable data regularity control unit structure effectively achieves non-aligned access and data regularity in the data loading process.

Description

The regular control module structure of a kind of coarseness dynamic reconfigurable data

Technical field

The utility model relates to the regular control module structure of a kind of coarseness dynamic reconfigurable data, belongs to imbedded reconfigurable designing technique.

Background technology

It is a kind of account form that the high efficiency of the dirigibility of software and hardware is combined that restructural calculates, such as field programmable gate array is exactly the instantiation of a restructural computing application.With difference between common microprocessor is that it not only can change and controls stream, can also change the structure of data path, there is high-performance, low hardware spending and power consumption, dirigibility is good, expansion is good.Be mainly used at present the algorithm of the computation-intensives such as media processing, pattern-recognition, Base-Band Processing.Along with flush bonding processor generally requires to shorten the design cycle, reduces design and development cost, the uncertainty of final market and technology is increasing in addition, and restructural is processed the trend that tentatively becomes flush bonding processor Overseas Development.Moreover, in the field of a lot of high-performance calculations, it also has relatedly, comprises structure analysis, Fluid Mechanics Computation, molecular simulation, biological information, chemistry, seismogeology (oil-gas exploration), numerical value meteorology, cosmology research etc.

Along with the requirement of all kinds of software application is more and more higher, corresponding, also more and more higher to the performance requirement of reconfigurable system.The data circulation that same restructural calculates also faces lot of challenges, except facing the challenge of large data access amount, also needs the performance in the face of bringing due to memory access inefficiency to reduce.The inefficient reason of memory access is except the intrinsic Memory accessing delay of storer, and data are storage configuration in storer, and the memory access mode of data also has great impact to transfer efficiency.Data transmission faces non-alignment access and the regular problem of data.

Traditional general processor can automatically be supplied data structure in compilation process, and the risk taking behavior that may produce this class problem is warned.The internal memory word operation not lining up tends to cause hardware anomalies, or converts this generic operation to twice read operation in the microcode of general processor.

Single-instruction multiple data-stream processing can the multiple data of Concurrency Access in the time of address align, but when face to face address not being lined up, need and splicing regular by data to obtain required data structure.Although Concurrency Access obtains significantly improving of data bandwidth, has increased programming complicacy, conventionally only carry out the rewriting of single-instruction multiple-data stream (SIMD) code for the core calculations of application.

Special IC mode, in the time realizing specific data memory access behavior, has high efficiency.The implementation of special IC can realize data displacement and memory access simultaneously, when increasing data memory access efficiency, improves the handling property of application.But, not only design complexity for the particular design of specific algorithm, and cause the limitation of special IC application.

In existing reconstruction structure research, adopt multiple method for designing to meet the regular demand of data stream.Traditional coarseness reconstruction structure, in order to meet data storage flexibly, adopts the mode of restructural computing unit display access to realize the outer data access of array, and carries out memory access for the storage organization of multimode and dispatch the demand that meets computational data.Such design simplification the route path of data, but the restructural computing unit of access has equally also taken computational resource, particularly the data access of restructural computing unit can cause whole array computation flowing water to get clogged, and causes calculated performance to be restricted.Although by the looking ahead and reuse and can effectively hide memory access expense of heuristic data, be still subject to the impact of data parallelism, cannot utilize the dependence of data between multiple calculating to obtain the execution performance of better data parallelism.

Utility model content

Goal of the invention: in order to overcome the deficiencies in the prior art, the utility model provides a kind of coarseness dynamic reconfigurable data regular control module structure, explore the memory access location register based on looking ahead and reuse, and the demand of the discrete IRREGULAR COMPUTATION data of reconfigurable arrays is studied, propose the regular unit of data stream based on the design of vector register file, solve the bottleneck of restructural memory access data path.

Technical scheme: for achieving the above object, the technical solution adopted in the utility model is:

The regular control module structure of a kind of coarseness dynamic reconfigurable data, comprise by hardware and connect the data flow con-trol module realizing, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, described three modules realize multi-level compound two level production lines by vector data register file and vector phase shift register file, and realize the synchronous of streamline by hardware handshaking, be specially:

Described vector load-on module: different pieces of information address space is conducted interviews, data are loaded into vector data register file from storer;

Described vector phase shift block: to data in vector data register file be shifted, the operation such as splicing, then the data of output are write to vector phase shift register file;

The described distribution module that unpacks: by the distribution operation of data in vector phase shift register file, to meet the demand of reconfigurable arrays to the concurrent input of computational data.

Preferably, described vector load-on module is mainly used in the memory access control of realization to different storage spaces and the non-alignment loading of data stream, comprises memory access steering logic, concurrent memory access state machine, storage inside memory access control, exterior storage memory access control and data selection:

Described memory access steering logic: the backward concurrent memory access state machine of decoding dynamic restructuring configuration information sends control command;

Described concurrent memory access state machine: utilize independently data path of inside and outside storer, the access request of the inside and outside storage of parallel processing, the time delay of waiting for to reduce data;

Described storage inside memory access control and exterior storage memory access control: except initiate to meet the accessing operation of agreement to system bus, also need the access request of non-alignment to be split as the repeatedly accessing operation of address align;

Described data selection: the data of returning from system bus write in vector data register file complete the operations such as displacement, splicing data selection after.

Preferably, be designed with the phase shift processing bunch that is similar to expansion single instruction multiple data stream organization in described vector phase shift block, the multiple different registers of parallel processing simultaneously, obtain many times of liftings of data-handling capacity.

Preferably, described phase shift is designed with phase shift processing unit in processing bunch, the structure of described phase shift processing unit and restructural computing unit are similar, and difference is that the processing core of phase shift processing unit is not the ALU calculating for the treatment of data, but the phase-shifting unit strengthening; Except processing the difference of core, data bit width is also much larger than restructural computing unit, adopts 128bit design; In addition, owing to adopting traditional single instruction multiple data stream organization, the input of data does not need to consider the direct connection of contiguous processing unit, and only needs to consider the input from different register files.

Preferably, described in unpack distribution module and be responsible for data after treatment vector phase shift block to be loaded into reconfigurable arrays data-interface, mainly realize two kinds of functions:

Alignment of data: owing to existing the data bit width of restructural computing unit and the calculating bit wide of application not to mate, thereby unpacking distribution module, need to carry out alignment operation to data, comprising displacement and polishing operation;

The route distribution of data: utilize the Dispatching Unit that unpacks unpacking in distribution module to realize the one-to-one relationship between each vector phase shift register and every row restructural computing unit in vector phase shift register file.

Should be appreciated that the various piece and the function thereof that in this case, relate to all realize by the connection of register or register file.

Beneficial effect: the regular control module structure of coarseness dynamic reconfigurable data that the utility model provides, vector load-on module, vector phase shift block and unpack distribution module and realize multi-level compound two level production lines by double buffering register, and realize the synchronous of streamline by hardware handshaking, efficiently solve non-alignment access and the regular problem of data in data loading procedure; With respect to the reconfigurable data memory access design of traditional display access, the design of the regular unit of data stream can effectively improve calculated performance and reach average 3.34 times.

Accompanying drawing explanation

Fig. 1 is structural representation of the present utility model;

Fig. 2 is vector load-on module structural representation;

Fig. 3 is concurrent memory access state machine state conversion schematic diagram;

Fig. 4 is vector phase shift block structural representation;

Fig. 5 is phase shift processing unit structural representation;

Fig. 6 is for unpacking distribution module structural representation.

Embodiment

Below in conjunction with accompanying drawing, the utility model is further described.

The regular control module structure of a kind of coarseness dynamic reconfigurable data, comprise by hardware and connect the data flow con-trol module realizing, as shown in Figure 1, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, three modules realize multi-level compound two level production lines by double buffering register, and realize the synchronous of streamline by hardware handshaking.

Described vector load-on module: the access by dynamic restructuring Configuration to different pieces of information address space, complete the loading of data from storer to vector data register file, concrete operations mode is as shown in Figure 2;

Described vector phase shift block: the mode configuring by dynamic reconfigurable realizes the operation such as displacement, splicing to data in vector data register file, writes vector phase shift register file by the data of output, and its working mechanism as shown in Figure 4;

The described distribution module that unpacks: by the distribution operation of data in dynamic restructuring Configuration vector phase shift register file, meet the demand of reconfigurable arrays to the concurrent input of computational data, workflow as shown in Figure 6.

Vector load-on module is connected with system by the standard A MBA AHB2.0 bus of two different bit wides, is connected respectively with on-chip memory and sheet file memory controller, can meet the design requirement of reconfigurable arrays to different pieces of information transmission.Dynamic restructuring configuration information is decoded in memory access steering logic, sends control command to concurrent memory access state machine.According to the difference of address space, concurrent memory access state machine control storage inside memory access control and exterior storage memory access control are sent accessing operation to system bus.The data of returning from system bus write in vector register file complete the operations such as displacement, splicing data selection after.Bus memory access agreement control has separately been safeguarded in storage inside memory access control and exterior storage memory access control simultaneously, for realizing mutual with external interface.Below by Fig. 3, concurrent memory access state machine working method is made a concrete analysis of.

Concurrent memory access state machine utilizes independently data path of inside and outside storer, and the access request of the inside and outside storage of parallel processing reduces the time delay that data are waited for.As shown in Figure 3, EI represents effective inner access request, and EE represents effective outside access request, and VI represents that effective internal request data return, and VE represents that effective external request data return.The redirect of state is carried out in the setting that concurrent memory access state machine is greater than the outer priority of sheet according to request signal according to priority on sheet.In the time that EI is effective, do not consider that whether EE is effective, concurrent memory access state machine all can be from IDLE state transition to INTERNAL state.Now, if EE signal is effective, concurrent memory access state machine jumps to BOTH state, processes internal data and external data memory access simultaneously; Otherwise if EE invalidating signal completes VI signal effective time in internal data memory access, concurrent memory access state machine turns back to IDLE state.Only ought only have EE signal effective, and when EI invalidating signal, concurrent memory access state machine enter EXTERNAL state from IDLE state.Now, if EI signal is effective, so concurrent memory access state machine jumps to BOTH state.At BOTH state, the sequencing that memory access completes according to data, returns to respectively INTERNAL state or EXTERNAL state from BOTH state.In the time there is multiple storage space accessing operation to same type, as on-chip memory is repeatedly accessed, now, concurrent memory access state machine only switches between IDLE and INTERNAL state, and concurrent memory access state machine is reduced to single order memory access steering logic.

In order to meet the design requirement of non-alignment access, need to process access request and the data of returning.Because the transmission of system bus need to meet address align, thereby storage inside memory access control and exterior storage memory access control are except initiating to meet the accessing operation of agreement to bus, also need the access request of non-alignment, be split as the repeatedly accessing operation of address align.Meanwhile, multiple data of returning need to be spliced, to obtain the vector data of non-alignment access.The concatenation of data completes in data selection.For example, when vector load-on module is 0x3 while starting to load from start address low level, because the mode of the employing 32bit alignment Burst4 of external bus is carried out data access, thereby, the splicing that the data of twice bus memory access 0x0 and 0x4 need to be shifted, could obtain required memory access data.

As shown in Figure 4, be designed with the phase shift processing bunch that is similar to expansion single instruction multiple data stream organization in described vector phase shift block, the multiple different registers of parallel processing simultaneously, obtain many times of liftings of data-handling capacity.

In this case, comprise that in whole data flow con-trol module two overlap independently phase-shift processing bunch, the highlyest can obtain the computing power of simultaneously processing 8 threads.Meanwhile, when computation requirement only has half, can close the second cover phase shift processing bunch, thereby reach the object that reduces phase shift order loading and reduce system power dissipation.Single " phase shift processing bunch " comprises 4 phase shift processing units, carries out design function optimization especially for vector phase shift instruction, and its data bit width and double buffering register match.

Be distributed in 4 phase shift processing units through after decoding successively according to the instruction in the description phase shift command queue of Fig. 4, each phase shift processing unit calculates according to resolved order.Data complete data processing minimum can the realization within 1 clock period of vector phase shift block, and maximum treatment cycle is relevant according to the length of command queue, is subject to the restriction of data phase shift demand and data volume size.

Description by Fig. 5 to phase shift processing unit, the input data of phase shift processing unit can be respectively from vector data register file and vector phase shift register file by selection signal, and output data write vector phase shift register file.Enter respectively the different port of phase shift processing unit by data selection from the data of vector data register file, the output of calculating finally writes vector phase shift register, and provides beacon signal.Meanwhile, vector phase shift register also as in computation process for depositing the register of ephemeral data.

The described distribution module that unpacks will be mapped on the data-interface of the every row of reconfigureable computing array by row after the data processing in double buffering register, according to flow process shown in Fig. 6, first according to unpacking the size that operates number and source operand in distribution command, input data are carried out to shifting function, afterwards, carry out the operation of data polishing according to the size setting of target operand, finally, by Data dissemination after treatment in the data-interface of computing array.Because each reconstruction processing unit has two data-in ports, therefore, for the reconfigurable arrays of 8x8, every row has 16 data-in ports, the data of 16 8bit of maximum that can hold in register like this align simultaneously after input array.

The above is only preferred implementation of the present utility model; be noted that for those skilled in the art; do not departing under the prerequisite of the utility model principle; can also make some improvements and modifications, these improvements and modifications also should be considered as protection domain of the present utility model.

Claims

1. the regular control module structure of coarseness dynamic reconfigurable data, it is characterized in that: comprise data flow con-trol module, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, described three modules realize multi-level compound two level production lines by vector data register file and vector phase shift register file, and realize the synchronous of streamline by hardware handshaking, be specially:

Described vector phase shift block: to the data in vector data register file be shifted, concatenation, then the data of output are write to vector phase shift register file;

The described distribution module that unpacks: data in vector phase shift register file are distributed.

2. the regular control module structure of coarseness dynamic reconfigurable data according to claim 1, is characterized in that: described vector load-on module comprises memory access steering logic, concurrent memory access state machine, storage inside memory access control, exterior storage memory access control and data selection:

Described concurrent memory access state machine: utilize independently data path of inside and outside storer, the access request of the inside and outside storage of parallel processing;

Described data selection: the data of returning from system bus write in vector data register file complete displacement, concatenation data selection after.

3. the regular control module structure of coarseness dynamic reconfigurable data according to claim 1, is characterized in that: in described vector phase shift block, design phase shift processing bunch, the multiple different registers of parallel processing simultaneously.

4. the regular control module structure of coarseness dynamic reconfigurable data according to claim 3, is characterized in that: described phase shift is designed with phase shift processing unit in processing bunch, and the processing core of described phase shift processing unit is the phase-shifting unit strengthening.