CN203706197U - Coarse-granularity dynamic and reconfigurable data regularity control unit structure - Google Patents

Coarse-granularity dynamic and reconfigurable data regularity control unit structure Download PDF

Info

Publication number
CN203706197U
CN203706197U CN201420060846.XU CN201420060846U CN203706197U CN 203706197 U CN203706197 U CN 203706197U CN 201420060846 U CN201420060846 U CN 201420060846U CN 203706197 U CN203706197 U CN 203706197U
Authority
CN
China
Prior art keywords
data
vector
phase shift
memory access
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201420060846.XU
Other languages
Chinese (zh)
Inventor
葛伟
曹鹏
马俊
刘波
杨锦江
徐凯
杨军
王超
卜爱国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201420060846.XU priority Critical patent/CN203706197U/en
Application granted granted Critical
Publication of CN203706197U publication Critical patent/CN203706197U/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Abstract

The utility model discloses a coarse-granularity dynamic and reconfigurable data regularity control unit structure. According to the coarse-granularity dynamic and reconfigurable data regularity control unit structure, a data stream control module comprises a vector loading module, a vector phase-shifting module and an unpacking distribution module, wherein multi-layered combined two-level assembly lines are obtained by the three modules through double buffer registers, and synchronization of the assembly lines is achieved through hardware handshaking. The vector loading module has access to different data address spaces through dynamic and reconfigurable allocation so as to accomplish loading of data from a memory to a vector data register file; the vector phase-shifting module achieves operations such as shifting and splicing of data streams in a dynamic and reconfigurable allocation mode, and output data are written in a vector phase-shifting register file; the unpacking distribution module achieves a distribution operation of register data through allocation, and the requirement of an array for concurrent input of calculating data is met. The coarse-granularity dynamic and reconfigurable data regularity control unit structure effectively achieves non-aligned access and data regularity in the data loading process.

Description

The regular control module structure of a kind of coarseness dynamic reconfigurable data
Technical field
The utility model relates to the regular control module structure of a kind of coarseness dynamic reconfigurable data, belongs to imbedded reconfigurable designing technique.
Background technology
It is a kind of account form that the high efficiency of the dirigibility of software and hardware is combined that restructural calculates, such as field programmable gate array is exactly the instantiation of a restructural computing application.With difference between common microprocessor is that it not only can change and controls stream, can also change the structure of data path, there is high-performance, low hardware spending and power consumption, dirigibility is good, expansion is good.Be mainly used at present the algorithm of the computation-intensives such as media processing, pattern-recognition, Base-Band Processing.Along with flush bonding processor generally requires to shorten the design cycle, reduces design and development cost, the uncertainty of final market and technology is increasing in addition, and restructural is processed the trend that tentatively becomes flush bonding processor Overseas Development.Moreover, in the field of a lot of high-performance calculations, it also has relatedly, comprises structure analysis, Fluid Mechanics Computation, molecular simulation, biological information, chemistry, seismogeology (oil-gas exploration), numerical value meteorology, cosmology research etc.
Along with the requirement of all kinds of software application is more and more higher, corresponding, also more and more higher to the performance requirement of reconfigurable system.The data circulation that same restructural calculates also faces lot of challenges, except facing the challenge of large data access amount, also needs the performance in the face of bringing due to memory access inefficiency to reduce.The inefficient reason of memory access is except the intrinsic Memory accessing delay of storer, and data are storage configuration in storer, and the memory access mode of data also has great impact to transfer efficiency.Data transmission faces non-alignment access and the regular problem of data.
Traditional general processor can automatically be supplied data structure in compilation process, and the risk taking behavior that may produce this class problem is warned.The internal memory word operation not lining up tends to cause hardware anomalies, or converts this generic operation to twice read operation in the microcode of general processor.
Single-instruction multiple data-stream processing can the multiple data of Concurrency Access in the time of address align, but when face to face address not being lined up, need and splicing regular by data to obtain required data structure.Although Concurrency Access obtains significantly improving of data bandwidth, has increased programming complicacy, conventionally only carry out the rewriting of single-instruction multiple-data stream (SIMD) code for the core calculations of application.
Special IC mode, in the time realizing specific data memory access behavior, has high efficiency.The implementation of special IC can realize data displacement and memory access simultaneously, when increasing data memory access efficiency, improves the handling property of application.But, not only design complexity for the particular design of specific algorithm, and cause the limitation of special IC application.
In existing reconstruction structure research, adopt multiple method for designing to meet the regular demand of data stream.Traditional coarseness reconstruction structure, in order to meet data storage flexibly, adopts the mode of restructural computing unit display access to realize the outer data access of array, and carries out memory access for the storage organization of multimode and dispatch the demand that meets computational data.Such design simplification the route path of data, but the restructural computing unit of access has equally also taken computational resource, particularly the data access of restructural computing unit can cause whole array computation flowing water to get clogged, and causes calculated performance to be restricted.Although by the looking ahead and reuse and can effectively hide memory access expense of heuristic data, be still subject to the impact of data parallelism, cannot utilize the dependence of data between multiple calculating to obtain the execution performance of better data parallelism.
Utility model content
Goal of the invention: in order to overcome the deficiencies in the prior art, the utility model provides a kind of coarseness dynamic reconfigurable data regular control module structure, explore the memory access location register based on looking ahead and reuse, and the demand of the discrete IRREGULAR COMPUTATION data of reconfigurable arrays is studied, propose the regular unit of data stream based on the design of vector register file, solve the bottleneck of restructural memory access data path.
Technical scheme: for achieving the above object, the technical solution adopted in the utility model is:
The regular control module structure of a kind of coarseness dynamic reconfigurable data, comprise by hardware and connect the data flow con-trol module realizing, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, described three modules realize multi-level compound two level production lines by vector data register file and vector phase shift register file, and realize the synchronous of streamline by hardware handshaking, be specially:
Described vector load-on module: different pieces of information address space is conducted interviews, data are loaded into vector data register file from storer;
Described vector phase shift block: to data in vector data register file be shifted, the operation such as splicing, then the data of output are write to vector phase shift register file;
The described distribution module that unpacks: by the distribution operation of data in vector phase shift register file, to meet the demand of reconfigurable arrays to the concurrent input of computational data.
Preferably, described vector load-on module is mainly used in the memory access control of realization to different storage spaces and the non-alignment loading of data stream, comprises memory access steering logic, concurrent memory access state machine, storage inside memory access control, exterior storage memory access control and data selection:
Described memory access steering logic: the backward concurrent memory access state machine of decoding dynamic restructuring configuration information sends control command;
Described concurrent memory access state machine: utilize independently data path of inside and outside storer, the access request of the inside and outside storage of parallel processing, the time delay of waiting for to reduce data;
Described storage inside memory access control and exterior storage memory access control: except initiate to meet the accessing operation of agreement to system bus, also need the access request of non-alignment to be split as the repeatedly accessing operation of address align;
Described data selection: the data of returning from system bus write in vector data register file complete the operations such as displacement, splicing data selection after.
Preferably, be designed with the phase shift processing bunch that is similar to expansion single instruction multiple data stream organization in described vector phase shift block, the multiple different registers of parallel processing simultaneously, obtain many times of liftings of data-handling capacity.
Preferably, described phase shift is designed with phase shift processing unit in processing bunch, the structure of described phase shift processing unit and restructural computing unit are similar, and difference is that the processing core of phase shift processing unit is not the ALU calculating for the treatment of data, but the phase-shifting unit strengthening; Except processing the difference of core, data bit width is also much larger than restructural computing unit, adopts 128bit design; In addition, owing to adopting traditional single instruction multiple data stream organization, the input of data does not need to consider the direct connection of contiguous processing unit, and only needs to consider the input from different register files.
Preferably, described in unpack distribution module and be responsible for data after treatment vector phase shift block to be loaded into reconfigurable arrays data-interface, mainly realize two kinds of functions:
Alignment of data: owing to existing the data bit width of restructural computing unit and the calculating bit wide of application not to mate, thereby unpacking distribution module, need to carry out alignment operation to data, comprising displacement and polishing operation;
The route distribution of data: utilize the Dispatching Unit that unpacks unpacking in distribution module to realize the one-to-one relationship between each vector phase shift register and every row restructural computing unit in vector phase shift register file.
Should be appreciated that the various piece and the function thereof that in this case, relate to all realize by the connection of register or register file.
Beneficial effect: the regular control module structure of coarseness dynamic reconfigurable data that the utility model provides, vector load-on module, vector phase shift block and unpack distribution module and realize multi-level compound two level production lines by double buffering register, and realize the synchronous of streamline by hardware handshaking, efficiently solve non-alignment access and the regular problem of data in data loading procedure; With respect to the reconfigurable data memory access design of traditional display access, the design of the regular unit of data stream can effectively improve calculated performance and reach average 3.34 times.
Accompanying drawing explanation
Fig. 1 is structural representation of the present utility model;
Fig. 2 is vector load-on module structural representation;
Fig. 3 is concurrent memory access state machine state conversion schematic diagram;
Fig. 4 is vector phase shift block structural representation;
Fig. 5 is phase shift processing unit structural representation;
Fig. 6 is for unpacking distribution module structural representation.
Embodiment
Below in conjunction with accompanying drawing, the utility model is further described.
The regular control module structure of a kind of coarseness dynamic reconfigurable data, comprise by hardware and connect the data flow con-trol module realizing, as shown in Figure 1, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, three modules realize multi-level compound two level production lines by double buffering register, and realize the synchronous of streamline by hardware handshaking.
Described vector load-on module: the access by dynamic restructuring Configuration to different pieces of information address space, complete the loading of data from storer to vector data register file, concrete operations mode is as shown in Figure 2;
Described vector phase shift block: the mode configuring by dynamic reconfigurable realizes the operation such as displacement, splicing to data in vector data register file, writes vector phase shift register file by the data of output, and its working mechanism as shown in Figure 4;
The described distribution module that unpacks: by the distribution operation of data in dynamic restructuring Configuration vector phase shift register file, meet the demand of reconfigurable arrays to the concurrent input of computational data, workflow as shown in Figure 6.
Vector load-on module is connected with system by the standard A MBA AHB2.0 bus of two different bit wides, is connected respectively with on-chip memory and sheet file memory controller, can meet the design requirement of reconfigurable arrays to different pieces of information transmission.Dynamic restructuring configuration information is decoded in memory access steering logic, sends control command to concurrent memory access state machine.According to the difference of address space, concurrent memory access state machine control storage inside memory access control and exterior storage memory access control are sent accessing operation to system bus.The data of returning from system bus write in vector register file complete the operations such as displacement, splicing data selection after.Bus memory access agreement control has separately been safeguarded in storage inside memory access control and exterior storage memory access control simultaneously, for realizing mutual with external interface.Below by Fig. 3, concurrent memory access state machine working method is made a concrete analysis of.
Concurrent memory access state machine utilizes independently data path of inside and outside storer, and the access request of the inside and outside storage of parallel processing reduces the time delay that data are waited for.As shown in Figure 3, EI represents effective inner access request, and EE represents effective outside access request, and VI represents that effective internal request data return, and VE represents that effective external request data return.The redirect of state is carried out in the setting that concurrent memory access state machine is greater than the outer priority of sheet according to request signal according to priority on sheet.In the time that EI is effective, do not consider that whether EE is effective, concurrent memory access state machine all can be from IDLE state transition to INTERNAL state.Now, if EE signal is effective, concurrent memory access state machine jumps to BOTH state, processes internal data and external data memory access simultaneously; Otherwise if EE invalidating signal completes VI signal effective time in internal data memory access, concurrent memory access state machine turns back to IDLE state.Only ought only have EE signal effective, and when EI invalidating signal, concurrent memory access state machine enter EXTERNAL state from IDLE state.Now, if EI signal is effective, so concurrent memory access state machine jumps to BOTH state.At BOTH state, the sequencing that memory access completes according to data, returns to respectively INTERNAL state or EXTERNAL state from BOTH state.In the time there is multiple storage space accessing operation to same type, as on-chip memory is repeatedly accessed, now, concurrent memory access state machine only switches between IDLE and INTERNAL state, and concurrent memory access state machine is reduced to single order memory access steering logic.
In order to meet the design requirement of non-alignment access, need to process access request and the data of returning.Because the transmission of system bus need to meet address align, thereby storage inside memory access control and exterior storage memory access control are except initiating to meet the accessing operation of agreement to bus, also need the access request of non-alignment, be split as the repeatedly accessing operation of address align.Meanwhile, multiple data of returning need to be spliced, to obtain the vector data of non-alignment access.The concatenation of data completes in data selection.For example, when vector load-on module is 0x3 while starting to load from start address low level, because the mode of the employing 32bit alignment Burst4 of external bus is carried out data access, thereby, the splicing that the data of twice bus memory access 0x0 and 0x4 need to be shifted, could obtain required memory access data.
As shown in Figure 4, be designed with the phase shift processing bunch that is similar to expansion single instruction multiple data stream organization in described vector phase shift block, the multiple different registers of parallel processing simultaneously, obtain many times of liftings of data-handling capacity.
Preferably, described phase shift is designed with phase shift processing unit in processing bunch, the structure of described phase shift processing unit and restructural computing unit are similar, and difference is that the processing core of phase shift processing unit is not the ALU calculating for the treatment of data, but the phase-shifting unit strengthening; Except processing the difference of core, data bit width is also much larger than restructural computing unit, adopts 128bit design; In addition, owing to adopting traditional single instruction multiple data stream organization, the input of data does not need to consider the direct connection of contiguous processing unit, and only needs to consider the input from different register files.
In this case, comprise that in whole data flow con-trol module two overlap independently phase-shift processing bunch, the highlyest can obtain the computing power of simultaneously processing 8 threads.Meanwhile, when computation requirement only has half, can close the second cover phase shift processing bunch, thereby reach the object that reduces phase shift order loading and reduce system power dissipation.Single " phase shift processing bunch " comprises 4 phase shift processing units, carries out design function optimization especially for vector phase shift instruction, and its data bit width and double buffering register match.
Be distributed in 4 phase shift processing units through after decoding successively according to the instruction in the description phase shift command queue of Fig. 4, each phase shift processing unit calculates according to resolved order.Data complete data processing minimum can the realization within 1 clock period of vector phase shift block, and maximum treatment cycle is relevant according to the length of command queue, is subject to the restriction of data phase shift demand and data volume size.
Description by Fig. 5 to phase shift processing unit, the input data of phase shift processing unit can be respectively from vector data register file and vector phase shift register file by selection signal, and output data write vector phase shift register file.Enter respectively the different port of phase shift processing unit by data selection from the data of vector data register file, the output of calculating finally writes vector phase shift register, and provides beacon signal.Meanwhile, vector phase shift register also as in computation process for depositing the register of ephemeral data.
The described distribution module that unpacks will be mapped on the data-interface of the every row of reconfigureable computing array by row after the data processing in double buffering register, according to flow process shown in Fig. 6, first according to unpacking the size that operates number and source operand in distribution command, input data are carried out to shifting function, afterwards, carry out the operation of data polishing according to the size setting of target operand, finally, by Data dissemination after treatment in the data-interface of computing array.Because each reconstruction processing unit has two data-in ports, therefore, for the reconfigurable arrays of 8x8, every row has 16 data-in ports, the data of 16 8bit of maximum that can hold in register like this align simultaneously after input array.
The above is only preferred implementation of the present utility model; be noted that for those skilled in the art; do not departing under the prerequisite of the utility model principle; can also make some improvements and modifications, these improvements and modifications also should be considered as protection domain of the present utility model.

Claims (4)

1. the regular control module structure of coarseness dynamic reconfigurable data, it is characterized in that: comprise data flow con-trol module, described data flow con-trol module comprises vector load-on module, vector phase shift block and unpacks three modules of distribution module, described three modules realize multi-level compound two level production lines by vector data register file and vector phase shift register file, and realize the synchronous of streamline by hardware handshaking, be specially:
Described vector load-on module: different pieces of information address space is conducted interviews, data are loaded into vector data register file from storer;
Described vector phase shift block: to the data in vector data register file be shifted, concatenation, then the data of output are write to vector phase shift register file;
The described distribution module that unpacks: data in vector phase shift register file are distributed.
2. the regular control module structure of coarseness dynamic reconfigurable data according to claim 1, is characterized in that: described vector load-on module comprises memory access steering logic, concurrent memory access state machine, storage inside memory access control, exterior storage memory access control and data selection:
Described memory access steering logic: the backward concurrent memory access state machine of decoding dynamic restructuring configuration information sends control command;
Described concurrent memory access state machine: utilize independently data path of inside and outside storer, the access request of the inside and outside storage of parallel processing;
Described storage inside memory access control and exterior storage memory access control: except initiate to meet the accessing operation of agreement to system bus, also need the access request of non-alignment to be split as the repeatedly accessing operation of address align;
Described data selection: the data of returning from system bus write in vector data register file complete displacement, concatenation data selection after.
3. the regular control module structure of coarseness dynamic reconfigurable data according to claim 1, is characterized in that: in described vector phase shift block, design phase shift processing bunch, the multiple different registers of parallel processing simultaneously.
4. the regular control module structure of coarseness dynamic reconfigurable data according to claim 3, is characterized in that: described phase shift is designed with phase shift processing unit in processing bunch, and the processing core of described phase shift processing unit is the phase-shifting unit strengthening.
CN201420060846.XU 2014-02-10 2014-02-10 Coarse-granularity dynamic and reconfigurable data regularity control unit structure Expired - Fee Related CN203706197U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201420060846.XU CN203706197U (en) 2014-02-10 2014-02-10 Coarse-granularity dynamic and reconfigurable data regularity control unit structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201420060846.XU CN203706197U (en) 2014-02-10 2014-02-10 Coarse-granularity dynamic and reconfigurable data regularity control unit structure

Publications (1)

Publication Number Publication Date
CN203706197U true CN203706197U (en) 2014-07-09

Family

ID=51056602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201420060846.XU Expired - Fee Related CN203706197U (en) 2014-02-10 2014-02-10 Coarse-granularity dynamic and reconfigurable data regularity control unit structure

Country Status (1)

Country Link
CN (1) CN203706197U (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020103058A1 (en) * 2018-11-21 2020-05-28 吴国盛 Programmable operation and control chip, a design method, and device comprising same
CN112631610A (en) * 2020-11-30 2021-04-09 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020103058A1 (en) * 2018-11-21 2020-05-28 吴国盛 Programmable operation and control chip, a design method, and device comprising same
CN111433758A (en) * 2018-11-21 2020-07-17 吴国盛 Programmable operation and control chip, design method and device thereof
CN111433758B (en) * 2018-11-21 2024-04-02 吴国盛 Programmable operation and control chip, design method and device thereof
CN112631610A (en) * 2020-11-30 2021-04-09 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure
CN112631610B (en) * 2020-11-30 2022-04-26 上海交通大学 Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure

Similar Documents

Publication Publication Date Title
CN103761075A (en) Coarse granularity dynamic reconfigurable data integration and control unit structure
US10515046B2 (en) Processors, methods, and systems with a configurable spatial accelerator
US7694084B2 (en) Ultra low power ASIP architecture
CN101055644B (en) Mapping processing device and its method for processing signaling, data and logic unit operation method
EP3776229A1 (en) Apparatuses, methods, and systems for remote memory access in a configurable spatial accelerator
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
CA2788263A1 (en) A tile-based processor architecture model for high-efficiency embedded homogeneous multicore platforms
CN101739235A (en) Processor unit for seamless connection between 32-bit DSP and universal RISC CPU
CN102279818A (en) Vector data access and storage control method supporting limited sharing and vector memory
WO2023092620A1 (en) Risc-v-based three-dimensional interconnection many-core processor architecture and operating method therefor
CN103984530A (en) Assembly line structure and method for improving execution efficiency of store command
US10659396B2 (en) Joining data within a reconfigurable fabric
US20230359584A1 (en) Compiler operations for tensor streaming processor
CN203706197U (en) Coarse-granularity dynamic and reconfigurable data regularity control unit structure
CN101739383B (en) Configurable processor architecture and control method thereof
US8555097B2 (en) Reconfigurable processor with pointers to configuration information and entry in NOP register at respective cycle to deactivate configuration memory for reduced power consumption
US7461235B2 (en) Energy-efficient parallel data path architecture for selectively powering processing units and register files based on instruction type
Stepchenkov et al. Recurrent data-flow architecture: features and realization problems
EP1623318A2 (en) Processing system with instruction- and thread-level parallelism
CN103455367A (en) Management unit and management method for realizing multi-task scheduling in reconfigurable system
Abdelhamid et al. MITRACA: A next-gen heterogeneous architecture
CN101246435A (en) Processor instruction set supporting part statement function of higher order language
WO2022139647A1 (en) A method and system for rearranging and distributing data of an incoming image for processing by multiple processing clusters
US10534608B2 (en) Local computation logic embedded in a register file to accelerate programs
CN1331043C (en) Method for supporting MMX command in mobile microprocessor and extended microprocessor

Legal Events

Date Code Title Description
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140709

Termination date: 20200210

CF01 Termination of patent right due to non-payment of annual fee