CN103984672B - Processor cluster structure based on shared register file and global synchronization module - Google Patents

Processor cluster structure based on shared register file and global synchronization module Download PDF

Info

Publication number
CN103984672B
CN103984672B CN201410197295.6A CN201410197295A CN103984672B CN 103984672 B CN103984672 B CN 103984672B CN 201410197295 A CN201410197295 A CN 201410197295A CN 103984672 B CN103984672 B CN 103984672B
Authority
CN
China
Prior art keywords
register file
signal
synchronization module
shared
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410197295.6A
Other languages
Chinese (zh)
Other versions
CN103984672A (en
Inventor
韩军
窦仁峰
曾凌云
曾晓洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201410197295.6A priority Critical patent/CN103984672B/en
Publication of CN103984672A publication Critical patent/CN103984672A/en
Application granted granted Critical
Publication of CN103984672B publication Critical patent/CN103984672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention belongs to the technical field of designing a high-performance dedicated processor system structure, in particular to a processor cluster structure based on a shared register file and a global synchronization module. The invention provides a concept of the shared register file and a realizing method of the shared register file, and the problem of efficient sharing and quick communication of data between processors in application fields such as a public key cryptography of which an operating numerical digit width is larger is effectively solved; for parameter sharing characteristics in some application algorithms, data which need to be shared can be stored in the shared register file, and thus smaller register expenditure is realized. Meanwhile, the invention also provides a global synchronization module circuit structure, which is used for realizing the efficient synchronization between the registers and realizing sequential communication by matching the shared register file. According to the processor cluster structure based on the shared register file and the global synchronization module, provided by the invention, functions of quick communication between clusters, efficient sharing between the clusters and accurate synchronization between the clusters can be realized.

Description

A kind of processor clustering architecture based on shared register file and global synchronization module
Technical field
The invention belongs to high-performance application specific processor architecture Design field, a kind of processor clustering architecture based on shared register file and global synchronization module.
Background technology
In recent years, along with the development in some field, the dynamical calculating towards this field is proposed challenge.Traditional universal processor framework is not optimized towards specific area, so being difficulty with dynamical calculating.Such as, in public key cryptography field, generally require the modulo operation realizing long bit wide data, and traditional processor is the lowest in this aspect efficiency, be difficult to meet the demand of some application.So, the processor architecture towards certain neighborhood optimization has been designed to solve the popular approach of relevant issues.
For the algorithm and application in some field, how realizing the high-speed traffic between multiple processor becomes and realizes high performance key technology.But for traditional based on message transmission or share memorizer communication mode for, its time delay is relatively large, it is impossible to realize the internuclear real-time Communication for Power of high speed low delay.How to solve this problem and become challenge.
For the algorithm and application in some field, it has certain parameter sharing characteristic.As public key cryptography field is used for carrying out the parameter etc. of modular multiplication.If the parameter of these long bit wides being copied directly to each processor unit realize sharing, the expense of many parts of bigger memory spaces can be brought undoubtedly.The storage of the shared parameter how realizing low overhead is the key realizing efficient system design.
For traditional multicore architecture based on shared memorizer, it is typically all the stationary problem being realized share and access by software.Owing to software realizes, synchronize to need the more clock cycle, and be difficulty with the internuclear synchronization of Cycle accurate.How to realize the key technology synchronizing also to be to realize high-performance calculation efficiently.
Summary of the invention
In order to overcome above-mentioned challenge, the present invention proposes processor clustering architecture based on shared register file and global synchronization module, it is possible to achieve the function communicating between quick bunch, sharing between efficient bunch, synchronize between accurate bunch.
A kind of based on shared register file and global synchronization module the processor clustering architecture that the present invention provides, by several processor units, a shared register file and a global synchronization module composition;Wherein: privately owned register file is set in each processor unit;It is respectively provided with some read ports and write port on described shared register file, privately owned register file, is connected with write port by read port between processor unit and shared register file, privately owned register file;Described global synchronization module is mainly by its internal sync bit depositor be used for judging whether that synchronizing successful combinational logic circuit is constituted, and described processor unit is connected by the synchronous input end mouth arranged in global synchronization module and global synchronization module;Wherein:
Processor unit sends register file read signal in decoding level and accesses the synchronizing signal of global synchronization module;The lower part bit of register file read signal is by the way of duplication, and the read port being simultaneously sent to share register file and privately owned register file carries out parallel read operation;The high partial bit of register file read signal is selected by selector for the read operation of shared register file exports the read operation output with privately owned register file;Global synchronization module judges to synchronize the most successfully according to synchronizing signal and the synchronous regime depositor thereof of synchronous input end mouth, and synchronized result is sent to each processor unit;
Described processor unit sends register file write data signal, register file writing address signal and register file write enable signal at Write-back stage;What described register file write data signal gave shared register file and privately owned register file by the way of directly duplication writes FPDP;The lower part bit of register file writing address signal, by the way of duplication, is simultaneously sent to share register file and the write port of privately owned register file;Register file write enable signal, selects to be shared register file to perform write operation or privately owned register file is performed write operation according to the highest part bit of writing address signal.
In the present invention, described sync bit depositor is made up of some groups of depositors, is often made up of several bit status depositors in group depositor;One bit status depositor input for certain group synchronous enabled signal and certain sync bit organized certain bit phase and after value;For judging whether that the synchronization bit organized according to certain of combinational logic circuit synchronized and all of status register value calculate, produce the consequential signal whether synchronized;Synchronized result signal exports to corresponding processor unit.
The beneficial effects of the present invention is: what the present invention provided process, and clustering architecture can realize communicates between quick bunch, share between efficient bunch, the function that synchronizes between accurate bunch.
Accompanying drawing explanation
Fig. 1 is the processor clustering architecture with four processors.
Fig. 2 is the connection of register file in decoding level of tradition risc processor.
Fig. 3 is the connection in decoding level of the processor with shared register file and privately owned register file.
Fig. 4 is the connection of processor and synchronization module.
Fig. 5 is the circuit structure of synchronization module.
Detailed description of the invention
With embodiment, the present invention is further elaborated on below in conjunction with the accompanying drawings.
The present invention proposes processor clustering architecture based on shared register file and global synchronization module, and this processor clustering architecture is by several processor units, a shared register file and a global synchronization module composition.Fig. 1 show the cluster unit structure of 4 processors.Connecting as in figure 2 it is shown, and in the present invention connection of processor and shared register file and privately owned register file is as shown in Figure 3 of traditional risc processor and register file.Wherein: each processor unit sends the read signal accessing shared register file in decoding level, and is connected to share the read port of register file;Each processor unit sends the write signal accessing shared register file at Write-back stage, and is connected to share the write port of register file;Shared depositor has several read ports, for connecting the read signal of several processor units;Shared depositor has several write ports, for connecting the write signal of several processor units;In the present invention, the connection of processor and global synchronization module is as shown in Figure 4.Wherein: each processor unit sends the synchronizing signal accessing global synchronization module in decoding level, and is connected to the synchronous input end mouth of global synchronization module;Global synchronization module judges to synchronize the most successfully according to the signal of synchronous input end mouth and the synchronous regime depositor inside it, and synchronized result is sent to each processor unit;
Described shared register file is the register file that several processor units can access simultaneously;Meanwhile, also there is inside each processor unit the privately owned register file that can only be accessed by certain processor unit oneself;Shared register file is structurally similar to privately owned register file, and the reading-writing port simply sharing register file is more than privately owned register file, needs and processor quantity matches;Share depositor and privately owned depositor and be equivalent to the sub-register cell in legacy register on connecting;The connection of processor and privately owned register file and shared register file is as shown in the figure.Wherein: the read port that processor unit can, by the way of duplication, be simultaneously sent to share register file and privately owned register file at the lower part bit decoding the depositor read signal that level sends carries out parallel read operation;Processor unit is selected by selector for exporting the read operation of the read operation of shared register file output and privately owned register file at the highest part bit of the register file read signal that decoding level sends;What the register file write data signal that processor unit sends at Write-back stage can give shared register file and privately owned register file by the way of directly duplication writes FPDP;Processor unit can be simultaneously sent to share register file and the write port of privately owned register file at the lower part bit of the register file writing address signal that Write-back stage sends by the way of duplication;The register file write enable signal that processor unit sends at Write-back stage, can select shared register file performs write operation or performs write operation for privately owned register file according to the highest part bit of writing address signal.
Described global synchronization module is by its internal sync bit depositor and is used for judging whether that synchronizing successful combinational logic circuit is constituted;Its sync bit depositor is made up of some groups of depositors, is often made up of several bit status depositors in group depositor;It is illustrated in figure 5 the shared synchronization module circuit structure with 4 processor units.Wherein: several processor units produce some groups of synchronous enabled signals and sync bit signal in decoding level;One bit status depositor input for certain group synchronous enabled signal and certain sync bit organized certain bit phase and after value;For judging whether that the synchronization bit organized according to certain of combinational logic circuit synchronized and all of status register value calculate, produce result (whether the synchronize success) signal whether synchronized;The main circuit of this combination logic to be constituted by XOR gate, NAND gate with door;The output of all status registers can and after sync bit signal carry out XOR in sequence, then carry out NOT-AND operation in certain sequence, finally by all outputs by by with door realize with and taper to a bit synchronous result;Synchronized result signal may be output to the decoding level of corresponding processor unit.
This processor structure, is verified in the design of public key cryptography processor platform.For based on shared depositor and global synchronization module communication, it is possible to achieve and the time delay (limit of minimum communication delay) that processor pipeline progression is suitable.Existence due to shared depositor, it is to avoid the identical parameters (the parameter bit wide used in cryptography is longer, and storage overhead is bigger) backup on multiple cores, it is achieved thereby that less bunch area.Need to realize the synchronization of how internuclear Cycle accurate simultaneously for some operation, and global synchronization module can provide the hardware synchronization performance of Cycle accurate just.The processor clustering architecture that the present invention proposes can realize communicating between quick bunch, it is Tong Bus with between accurate bunch, such that it is able to solution is towards the design challenge in some field to share between efficient bunch.

Claims (1)

1. a processor clustering architecture based on shared register file and global synchronization module, it is characterised in that: this processor clustering architecture is by several processor units, a shared register file and a global synchronization module composition;Wherein: privately owned register file is set in each processor unit;It is respectively provided with some read ports and write port on described shared register file, privately owned register file, is connected with write port by read port between processor unit and shared register file, privately owned register file;Described global synchronization module is mainly by its internal sync bit depositor be used for judging whether that synchronizing successful combinational logic circuit is constituted, and described processor unit is connected by the synchronous input end mouth arranged in global synchronization module and global synchronization module;Wherein:
Processor unit sends register file read signal in decoding level and accesses the synchronizing signal of global synchronization module;The lower part bit of register file read signal is by the way of duplication, and the read port being simultaneously sent to share register file and privately owned register file carries out parallel read operation;The high partial bit of register file read signal is selected by selector for the read operation of shared register file exports the read operation output with privately owned register file;Global synchronization module judges to synchronize the most successfully according to synchronizing signal and the synchronous regime depositor thereof of synchronous input end mouth, and synchronized result is sent to each processor unit;
Described processor unit sends register file write data signal, register file writing address signal and register file write enable signal at Write-back stage;What described register file write data signal gave shared register file and privately owned register file by the way of directly duplication writes FPDP;The lower part bit of register file writing address signal, by the way of duplication, is simultaneously sent to share register file and the write port of privately owned register file;Register file write enable signal, selects to be shared register file to perform write operation or performs write operation for privately owned register file according to the highest part bit of writing address signal;
Described sync bit depositor is made up of some groups of depositors, is often made up of several bit status depositors in group depositor;One bit status depositor input for certain group synchronous enabled signal and certain sync bit organized certain bit phase and after value;For judging whether that the synchronization bit organized according to certain of combinational logic circuit synchronized and all of status register value calculate, produce the consequential signal whether synchronized;Synchronized result signal exports to corresponding processor unit.
CN201410197295.6A 2014-05-12 2014-05-12 Processor cluster structure based on shared register file and global synchronization module Active CN103984672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410197295.6A CN103984672B (en) 2014-05-12 2014-05-12 Processor cluster structure based on shared register file and global synchronization module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410197295.6A CN103984672B (en) 2014-05-12 2014-05-12 Processor cluster structure based on shared register file and global synchronization module

Publications (2)

Publication Number Publication Date
CN103984672A CN103984672A (en) 2014-08-13
CN103984672B true CN103984672B (en) 2017-01-11

Family

ID=51276650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410197295.6A Active CN103984672B (en) 2014-05-12 2014-05-12 Processor cluster structure based on shared register file and global synchronization module

Country Status (1)

Country Link
CN (1) CN103984672B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833439A (en) * 2010-04-20 2010-09-15 清华大学 Parallel computing hardware structure based on separation and combination thought
CN102141974A (en) * 2011-04-11 2011-08-03 复旦大学 Internuclear communication method of multinuclear processor and circuit structure thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101833439A (en) * 2010-04-20 2010-09-15 清华大学 Parallel computing hardware structure based on separation and combination thought
CN102141974A (en) * 2011-04-11 2011-08-03 复旦大学 Internuclear communication method of multinuclear processor and circuit structure thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A 65nm 39GOPS/W 24-Core Processor with 11Tb/s/W Packet-Controlled Circuit-Switched Double-Layer Network-on-Chip and Heterogeneous Execution Array;Peng Ou et al;《2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers》;20130221;第56页第1栏第1段第4-12行、第2段、第3段第18-21行,第2栏第2段第4-10行、第14-21行,图3.6.1-3.6.6 *

Also Published As

Publication number Publication date
CN103984672A (en) 2014-08-13

Similar Documents

Publication Publication Date Title
CN103744644B (en) The four core processor systems built using four nuclear structures and method for interchanging data
CN101329589B (en) Control system and method of low power consumption read-write register
CN108536642A (en) Big data operation acceleration system and chip
TW200949591A (en) Synchronous to asynchronous logic conversion
CN101236774B (en) Device and method for single-port memory to realize the multi-port storage function
Fu et al. A study on the optimization of blockchain hashing algorithm based on PRCA
CN101441616B (en) Rapid data exchange structure based on register document and management method thereof
CN104699641A (en) EDMA (enhanced direct memory access) controller concurrent control method in multinuclear DSP (digital signal processor) system
US8443315B2 (en) Reset mechanism conversion
CN103761072A (en) Coarse granularity reconfigurable hierarchical array register file structure
Nguyen et al. An efficient FPGA-based database processor for fast database analytics
CN103577161A (en) Big data frequency parallel-processing method
WO2020087276A1 (en) Big data operation acceleration system and chip
CN104820659A (en) Multi-mode dynamic configurable high-speed memory access interface for coarse grain reconfigurable system
CN103984672B (en) Processor cluster structure based on shared register file and global synchronization module
CN105825880B (en) Access control method, device and circuit for DDR controller
CN104035896A (en) Off-chip accelerator applicable to fusion memory of 2.5D (2.5 dimensional) multi-core system
Contini et al. Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
CN101980140B (en) SSRAM access control system
CN208298179U (en) Big data operation acceleration system and chip
CN203706196U (en) Coarse-granularity reconfigurable and layered array register file structure
US20170212861A1 (en) Clock tree implementation method, system-on-chip and computer storage medium
WO2020087275A1 (en) Method for big data operation acceleration system carrying out operations
CN106709187B (en) Method and device for establishing CPU based on model
CN204009891U (en) The soft core of a kind of sixteen bit embedded chip

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant