CN106201939B - Multicore catalogue consistency device towards GPDSP framework - Google Patents
Multicore catalogue consistency device towards GPDSP framework Download PDFInfo
- Publication number
- CN106201939B CN106201939B CN201610503703.5A CN201610503703A CN106201939B CN 106201939 B CN106201939 B CN 106201939B CN 201610503703 A CN201610503703 A CN 201610503703A CN 106201939 B CN106201939 B CN 106201939B
- Authority
- CN
- China
- Prior art keywords
- request
- data
- final stage
- processing
- cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
- G06F13/30—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal with priority control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2213/00—Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F2213/28—DMA
- G06F2213/2806—Space or buffer allocation for DMA transfers
Abstract
A kind of multicore catalogue consistency device towards GPDSP framework, comprising: kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete the carrying of peripheral hardware and interior internuclear data;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal processing unit completes the processing of load, store instruction, the Monitor Deal processing unit is used to respond the monitoring request of any time arrival, and treatment process is not influenced by Normal Deal processing unit;On piece final stage Cache, it is distributed to connect on on-chip interconnection network;Piece external storage DDR, data buffer storage is in L1D and on piece final stage Cache;On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decode out and request is sent to corresponding position after destination node and purpose equipment.The present invention has many advantages, such as that principle is simple and convenient to operate, flexibility is high, applied widely.
Description
Technical field
Present invention relates generally to processor microarchitecture design fields, refer in particular to a kind of suitable general-purpose digital signal processing
The design of device (General-Purpose Digital Signal Processor, GPDSP) multicore storage access.
Background technique
One of the carriage drawn by a team of three horses (CPU, DSP and GPU) as field of processors, DSP is with its high-performance power dissipation ratio, good
It the advantages that programmability, low-power consumption, is widely used in embedded system.It is different from CPU, 1) DSP has the following characteristics that
Computing capability is strong, and concern, which is implemented to calculate, is better than Focus Control and transaction;2) have for typical signal processing special hard
Part is supported, such as multiply-add operation, cyclic addressing;3) common feature of embedded microprocessor, as address and instruction path are few
In 32;Non-precisely interrupt;Short-term offline debugging, long-term online resident operation job-program mode (rather than universal cpu is debugged
The method run);4) Peripheral Interface is integrated based on quick peripheral hardware, is especially beneficial online transmitting-receiving AD/DA signal, is also supported
High speed is direct-connected between DSP.
How high-performance power dissipation ratio and powerful computing capability in view of DSP, upgrade traditional DSP architecture
It is allowed to be suitable for high performance cementitious materisl with improvement, is current research hotspot both domestic and external.
As (application number: general-purpose computations digital signal processor proposes Chinese patent application in 201310725118.6)
A kind of advantage of the essential characteristic not only having kept DSP and high-performance low-power-consumption, but also can efficiently support the multicore of general scientific algorithm micro-
Processor GPDSP.The structure has the feature that direct expression 1) with double-precision floating point and 64 fixed-point datas, general
Register, data/address bus, instruction bit wide 64 or more, address-bus width 40 or more;2) CPU and DSP heterogeneous polynuclear
Close-coupled, CPU core support complete operating system, the scalar units of DSP core to support operating system micronucleus;3) consider CPU core,
The unified programming mode of vector array structure in DSP core and DSP core;4) it keeps its machine to intersect artificial debugging, while local being provided
CPU host's debugging mode;5) retain the essential characteristic of the common DSP in addition to digit.
For another example Chinese patent application (GPDSP cooperate at many levels with shared storage device grade access method, application number:
20150135194.0) it is directed to the application demand of GPDSP, proposes the storage architecture of multi-level collaboration and effect.Each DSP core
Comprising globally addressable and to the programmable local large capacity scalar of programmer, vector array memory, multiple DSP cores pass through
Network-on-chip shares large capacity on piece overall situation Cache;It is provided by direct memory access controller (DMA) and network-on-chip high
The internuclear of bandwidth, core built-in storage and overall situation Cache data illustrate, realize monokaryon is interior, between multicore data access collaboration
With it is shared.
In GPDSP, on-chip memory and the backstage DMA in core are moved mechanism and are retained, this will be to data consistency
Realization brings difficulty.In a kind of Chinese patent application (explicit multicore Cache consistency active management side towards stream application
Method, application number: 201310166383.5) using the data coherence management method of software programmable in, will safeguard data one
Programmer has been given in the work of cause property, needs the correctness of the productive consumption process of programmer's active management maintenance data.If
Hardware management hardware coherence agreement is used in GPDSP, will mitigate the programming complexity of programmer, and then improve
The versatility of GPDSP.
Summary of the invention
In view of the problems of the existing technology the technical problem to be solved in the present invention is that the present invention provides a kind of original
Reason is simple and convenient to operate, flexibility is high, the multicore catalogue consistency device applied widely towards GPDSP framework.
In order to solve the above technical problems, the invention adopts the following technical scheme:
A kind of multicore catalogue consistency device towards GPDSP framework, comprising:
Kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete peripheral hardware and interior internuclear data
Carrying;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal
Processing unit completes the processing of load, store instruction, and the Monitor Deal processing unit is used to respond any time arrival
Monitoring request, and treatment process is not influenced by Normal Deal processing unit;
On piece final stage Cache, it is distributed to connect on on-chip interconnection network;
Piece external storage DDR, its data can be buffered in L1D and on piece final stage Cache;
On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decoding
Request is sent to corresponding position after destination node and purpose equipment out.
As a further improvement of the present invention: be divided into several body in the on piece final stage Cache, each body by
Input buffer cell IBUF, pipelined units PipeLine, output buffer cell OBUF and return looped network handle logic unit
Rtn NAC composition;The input buffer cell is used to be responsible for the request that caching enters final stage Cache from network-on-chip;The stream
Pipeline units are used to carry out streamlined processing to the access DDR memory space request from input buffering;The output buffering is single
Member is used to be responsible for the request of caching final stage Cache access DDR;The return looped network processing logic unit is used to be responsible for multiple types
The network-on-chip request that enters of type carries out arbitration process.
As a further improvement of the present invention: further including MSI directory protocol unit, the request for issuing to L1D carries out
Consistency maintenance;The MSI directory protocol unit is made of tri- directory states of M, S, I;M state indicates data by some DSP
Core is exclusive and data are dirty;S state indicates that data are shared by one or more DSP Core and data are clean;I state table
Show all DSP Core all without data copy.
As a further improvement of the present invention: further including contents controller, the contents controller is sentenced on protocol hierarchy
The correctness of disconnected scheme, for completing to single request at the conflict of processing, multiple association requests under different directories state
Reason and the data block for being in catalogue " intermediate state " handle the response of association requests;The contents controller is divided into two classes, and one
Class is placed in L1D, another kind of to be placed on piece final stage Cache, and directory operation carries out in L1D and on piece final stage Cache.
As a further improvement of the present invention: storage has the bibliographic structure of complete catalogue, institute in the on piece final stage Cache
Stating bibliographic structure is each data block distribution directory entry being buffered on piece final stage Cache;The directory entry includes catalogue shape
State and shared list two parts, the shared list are that each DSP Core distributes one to indicate data whether corresponding
There is copy in DSP Core.
As a further improvement of the present invention: in the L1D use pipeline organization, to complete Instruction decoding,
Location calculating, read flag and mode bit, judgement hit, data volume access and data return to the fluvial processes operated.
As a further improvement of the present invention: the pipeline organization of the L1D be respectively DC1, DC2, EX1, EX2, EX3,
EX4,EX5;When L1D receive load, store instruction after, it is carried out first two station flowing water decoding processing, i.e. DC1 and
DC2, the action type of decision instruction and the function to be realized;It is carried out address computation after the completion of Instruction decoding, i.e. EX1, completes function
It is able to achieve;Then it completes to read to execute 3 bats, writes and execute the 2 L1DCache memory access assembly lines clapped, wherein read operation assembly line is in SMAC
It is in the position EX2, EX3, EX4 in memory access main pipeline, is the judgement of hit missing, memory access/missing processing, memory access output respectively;
Write operation assembly line is then in the position EX2, EX3, is missing from judgement, memory access/missing processing respectively.
As a further improvement of the present invention: using pipeline organization in final stage Cache on said sheets, read to realize
Label and mode bit read the fluvial processes that directory entry, judgement hit, monitoring processing, data volume access and data return to operation.
As a further improvement of the present invention: the pipeline organization of the on piece final stage Cache includes:
Assembly line first order Req_Arb: it requests to carry out round-robin arbitration with Flush, and the request arbitrated out is sent to next
Grade flowing water;Meanwhile going back significance bit, dirty position and the Tag information of read requests data block;
Assembly line second level Tag_Wait: judge whether it is catalog request, and read directory information;
Assembly line third level Tag_Judge: first determine whether request whether hit, if missing if judge whether again in MBUF
Address is related;If relevant miss request, then request is sent to MBUF;If incoherent miss request, then request is sent out
To MBUF and OBUF;Hit requests perform corresponding processing according to whether for catalog request;If non-catalog request, then generate
It is enabled to access data volume;If catalog request, then directory entry information is checked, asked according to the difference of directory states and shared list
The processing asked is also different;The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access data
Body is enabled;Second class waits L1D data to return, and pending data generates access data volume again after arriving at enabled;Third class waits Inv-
Ack request, response to be invalidated generates access data volume again after all arriving at enabled;
Assembly line fourth stage Data_Acc: the classification of request processing data body is first determined whether;It, then will request if read operation
The data block of access latches one after data volume reading and claps;If write operation, then first write-in data are encoded, then carry out more
New data gymnastics is made;The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation;
Assembly line level V Data_Dec: decoded operation is carried out to the request data that upper level is read, and will read to return to number
Upper ring arbitration process module is sent to according to request to be handled.
Compared with the prior art, the advantages of the present invention are as follows:
1, the multicore catalogue consistency device of the invention towards GPDSP framework, principle is simple and convenient to operate, flexibility
It is high, applied widely, it is same to the data sign processing of DMA request other than supporting the data sign processing of L1D request
It supports, has widened application field.
2, the multicore catalogue consistency device of the invention towards GPDSP framework utilizes catalogue control using contents controller
Device mechanism processed can judge the correctness of multicore data consistency scheme on protocol hierarchy, substantially reduce multicore catalogue consistency
The design and proving period of device.
Detailed description of the invention
Fig. 1 is present invention principle schematic diagram in specific application example.
Fig. 2 is that the present invention instructs handling principle schematic diagram in specific application example;Wherein (a) is instruction handling principle figure
(1), (b) to instruct handling principle figure (2), it is (c) instruction handling principle figure (3), (d) is handled for level one data Cache replacement
Schematic diagram.
Fig. 3 is present invention DMA handling principle schematic diagram in specific application example;Wherein (a) is that DMA read request processing is former
Reason figure (b) is DMA write request handling principle figure.
Fig. 4 is the present invention a certain group of bibliographic structure figure in final stage Cache catalogue memory bank in specific application example.
Fig. 5 is present invention L1D assembly line overall structure diagram in specific application example.
Fig. 6 is present invention final stage Cache assembly line overall structure diagram in specific application example.
Specific embodiment
The present invention is described in further details below with reference to Figure of description and specific embodiment.
In specific application example, GPDSP framework includes N number of core;GetX request indicates GetS request or GetM request;
Fwd-GetX request indicates Fwd-GetS request or Fwd-GetM request;PutX request indicates PutS request or PutM+
Data request;Catalogue X state indicates catalogue S state or catalogue M state;Catalogue Y state indicates catalogue S state or catalogue I
State.
As shown in Figure 1, the multicore catalogue consistency device of the invention towards GPDSP framework, comprising:
Kernel includes DMA and level one data Cache (L1D);Wherein DMA is used to complete removing for peripheral hardware and interior internuclear data
Fortune, needs the configuration of programmer before starting.There are two parallel processing elements of Normal Deal and Monitor Deal in L1D.Its
In, Normal Deal processing unit completes the processing of load, store instruction.It, may be also if instruction judges missing in L1D
Processing need to be replaced;Monitor Deal processing unit can respond the monitoring request of any time arrival, and treatment process
It is not influenced by Normal Deal processing unit.
On piece final stage Cache, it is distributed to connect on on-chip interconnection network, N number of body can be divided into.Every individual is by defeated
Enter buffer cell (IBUF), pipelined units (PipeLine), output buffer cell (OBUF) and returns to looped network processing logic
Unit (Rtn NAC) composition.Wherein, input buffer cell is used to be responsible for the request that caching enters final stage Cache from network-on-chip;
Pipelined units are used to carry out streamlined processing to the access DDR memory space request from input buffering;Export buffer cell
For being responsible for the request of caching final stage Cache access DDR;Return looped network processing logic unit be used to be responsible for it is a plurality of types of into
Enter network-on-chip request and carries out arbitration process.
Piece external storage DDR, also referred to as main memory, data can be buffered in L1D and final stage Cache.
On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decoding
Request is sent to corresponding position after destination node and purpose equipment out.In addition to this, on-chip interconnection network also can be to from end
The request of grade Cache is handled, and principle is similar with a upper situation.The network request is other except the local request of removing
Request, and the local request for requesting to issue for the corresponding core of individual every in final stage Cache.
The following table 1 is the detailed description of all catalog requests, it comprises during directory operation it is possible that it is all
Request type is conveniently introduced catalogue coherency mechanism with this.
Table 1
The present invention in the specific application process, further uses the MSI directory protocol of extension.Basic catalogue is assisted first
View is described in detail, and basic directory protocol only carries out consistency maintenance to L1D.
After L1D receives instruction, instruction can be operated at once.According to instruction access data block L1D caching
And its difference of dirty bit information, treatment process would also vary from.Wherein, L1D read hit or it is dirty write under hit situation, directly
It connects and instruction is handled.In addition to this all situations, L1D can send request to final stage Cache to cause next step
Operation.
In the specific application process, as shown in Fig. 2, the present invention can be divided into four classes according to the complexity of operation, point
(a) Wei not schemed, figure (b), figure (c), scheme situation shown in (d).
Figure (a) is instruction handling principle figure (1).It is completed it can be seen from the figure that instruction operates in two steps.According to instruction
The difference of classification can be divided into two kinds of situations.After L1D reads missing, GetS can be generated and request and be sent to final stage Cache.End
After grade Cache receives GetS request, directory entry is first looked at.Since directory states are I or S, show that latest data caches
In final stage Cache.Therefore, data are directly taken out and return to request source by final stage Cache;After L1D writes missing, it can generate
GetM is requested and is sent to final stage Cache.After final stage Cache receives GetM request, directory entry is first looked at.Due to catalogue
State is I, shows that latest data is buffered in final stage Cache.Therefore, data are directly taken out and are returned to and asked by final stage Cache
Ask source.
Figure (b) is instruction handling principle figure (2).It is completed it can be seen from the figure that instruction operates in three steps.According to instruction
The difference of classification can be divided into two kinds of situations.After L1D reads missing, GetS can be generated and request and be sent to final stage Cache.End
After grade Cache receives GetS request, directory entry is first looked at.Since directory states are M, show latest data not in final stage
In Cache.Therefore, final stage Cache can send Fwd-GetS and request to the L1D for possessing latest data copy.The L1D is received
After Fwd-GetS request, it can be sent simultaneously to final stage Cache and request source and read returned data request, and local directory is become into S
State;After L1D writes missing, GetM can be generated and request and be sent to final stage Cache.After final stage Cache receives GetM request,
First look at directory entry.Since directory states are M, show latest data not in final stage Cache.Therefore, final stage Cache can be sent out
Fwd-GetM is sent to request to the L1D for possessing latest data copy.After the L1D receives Fwd-GetM request, only sent out to request source
It send and reads returned data request, and local directory is become into I state.
Figure (c) is instruction handling principle figure (3).It is completed it can be seen from the figure that instruction operates in three steps.According to instruction
Whether hit with whether data are dirty difference, two kinds of situations can be divided into.After L1D writes missing, GetM can be generated and request and incite somebody to action
It is sent to final stage Cache.After final stage Cache receives GetM request, directory entry is first looked at.Since directory states are S, final stage
Cache can send Inv-L request to all L1D for possessing data copy according to list information is shared, and read simultaneously data and return
Back to requestor, and carry Ack signal.After L1D receives Inv-L request, local data block can be destroyed, and send out to requestor
Inv-Ack is sent to request.The L1D of missing is write after receiving data and all invalid responses, just carries out write operation;L1D is dry
Only after writing hit, GetM request can also be generated.Since its handling principle is consistent with deletion condition is write, it is not discussed herein.
Figure (d) is L1D replacement handling principle figure.It is completed it can be seen from the figure that instruction operates in two steps.According to data
Whether be it is dirty, two kinds of situations can be divided into.When replacement is clean, L1D can send PutS request to final stage Cache.Final stage
After Cache receives PutS request, shared list information will be updated, if updated shared list is complete zero, also need to repair
Change directory states, so that it becomes I.After the completion of operation, final stage Cache can send Put-Ack request to requestor;Work as replacement
When dirty, L1D can send PutM+data request to final stage Cache.After final stage Cache receives PutM+data request, it will be updated
Data volume and directory information.After the completion of operation, final stage Cache can send Put-Ack request to requestor.
In dsp, DMA is needed to carry operation in peripheral hardware and the interior internuclear a large amount of data of progress.If not carried out to DMA consistent
Property maintenance, inevitably will appear the inconsistent problem of multicore data;If being carried out to DMA using the software-hardware synergism mechanism of synchronization unit
Consistency maintenance needs programmer to monitor memory space disposition in real time, proposes no small challenge to programmer.Therefore, originally
Invention has carried out protocol extension to base directory agreement, it is made also to support to carry out consistency maintenance operation to DMA request.DMA is logical
It crosses interference networks and directly accesses final stage Cache.According to the difference of access memory bank operation, DMA request can be divided into two classes, point
It Wei not situation shown in Fig. 3.
Scheming (a) is DMA read request handling principle figure.After final stage Cache receives DMA read request, directory information can be checked,
And corresponding processing is done according to the difference of directory information.When catalogue is I, S state, since latest data is in final stage Cache,
DMA read request is directly returned to DMA after reading data;When catalogue is M state, since final stage Cache does not have latest data,
It can send Fwd-Rd request to the L1D for possessing latest copy.It, can be simultaneously to final stage after the L1D receives Fwd-Rd request
Cache and DMA, which is sent, reads returned data request, and catalogue is become S state.
Scheming (b) is DMA write request handling principle figure.After final stage Cache receives DMA write request, directory information can be checked,
And corresponding processing is done according to the difference of directory information.When catalogue is I state, since latest data is in final stage Cache,
DMA write request will be updated data volume, and return to answer signal to DMA after operation is completed;When catalogue is M state, due to not having
There is latest data, final stage Cache can send Fwd-Wrt request to the L1D for possessing latest copy.The L1D receives Fwd-Wrt and asks
After asking, is only sent to final stage Cache and read returned data request, and destroy local data block.Final stage Cache receives newest number
After return, the data carried with DMA write request are integrated, and then update data volume and directory information again.After the completion of operation,
Final stage Cache can return to answer signal to DMA;When catalogue is S state, final stage Cache meeting basis shares list information to institute
There is the L1D for possessing data copy to send Inv-DE request.After these L1D receive Inv-DE request, local data can be destroyed
Block, and Inv-Ack request is sent to final stage Cache.After final stage Cache receives all invalid responses, write operation is just carried out,
And answer signal is returned to DMA after operation is completed.
Directory operation have atomicity, but design realize during will appear many conflicts the case where.Such as: in chip
In global angle, before previous catalog request processing is completed, the latter relevant catalog request can generate conflict when arriving at.For
These conflict situations, the present invention are resolved using contents controller mechanism.According to the position where contents controller by its point
It is two kinds, respectively L1D contents controller and final stage Cache contents controller.
The following table 2 .1,2.2,2.3 are the detailed descriptions of L1D contents controller.As can be seen from the table, data block catalogue shape
State is other than tri- stable states of M, S, I, and there is also very much " centre " states.Such as: L1D reads missing to missing data return period
Between, the data block of read request access is constantly in ISDState.Data block in " centre " state can be relevant with response section
Request is monitored, (stall) processing that can only be paused is requested for the monitoring that cannot be responded.It not only ensure that data are consistent in this way
Property, and improve system performance.Such as: line to be replaced directory states become MI after the dirty replacement of L1DA.Phase is sent in final stage Cache
Before the answer signal answered arrives at, if this L1D receives a relevant Fwd-Rd request, it can respond and terminate in operation at once
When directory states are become into SIA。
Table 2.1
Table 2.2
Table 2.3
It is more special completely to write hit operation by L1D in the directory protocol that the present invention designs.Under normal circumstances, request hit nothing
Need to fetch evidence to next stage Cache.In order to simplify directory protocol complexity, it is write into missing processing with L1D and is classified as same class.
The invalidation request that L1D is received is divided into two kinds, and respectively Inv-L request and Inv-DE are requested.Since the present invention is set
The consistency maintenance to DMA is increased in the directory protocol of meter, and is operated in such a way that writing is useless.Therefore, DMA write request pair
When final stage Cache is operated, answer signal should be returned to end by issuing each Inv-DE request for possessing data copy L1D
Grade Cache.And L1D may also can cause the useless operation of writing (sending Inv-L request) when executing store instruction, but without effect
It answers signal and returns to L1D.Different types of invalidation request, the equipment that answer signal returns is different.Therefore, originally
Invention classifies to it.
Invalid response (Inv-Ack) request can according to circumstances do corresponding processing after arriving at L1D.If it is not the last one
Invalid response request, then request the directory states of corresponding data block not change;Otherwise, the catalogue shape of corresponding data block is requested
State is from IMAOr SMABecome M.
The data for arriving at L1D read source, respectively final stage Cache and others L1D there are two return requests.And due to portion
Divided data, which is read to return, requests to need to carry Ack information (record needs the invalid response number returned), it is therefore desirable to carry out to it
Differentiation processing.As shown in the first row of table 2.3, data read return request can be divided into from owner L1D and do not carry Ack,
From owner L1D and carries Ack, do not carry from final stage Cache and Ack, from final stage Cache and Ack is 0, from end
Grade Cache and Ack be greater than 0 five kinds of situations.
The following table 3 .1,3.2,3.3 are the detailed descriptions of final stage Cache contents controller.Similar to L1D contents controller,
There is also " centre " states in final stage Cache contents controller.Such as: L1D can be asked after reading missing to final stage Cache sending GetS
It asks.Directory states can become S when the data block requested access to is in M state in final stage CacheD, until newest data
It returns.Invalid response (Inv-Ack) request is similar with L1D contents controller, is not discussed herein.
Table 3.1
Table 3.2
Table 3.3
L1D replacement is divided into clean row replacement and dirty row replaces two kinds.
Clean row replacement can send PutS request to final stage Cache.There may be multiple cores one shared in actual operation
The case where data block, at this moment PutS request can be divided into the last one non-(PutS- according to the sequence for arriving at final stage Cache
) and the last one (PutS-Last) two kinds of situations NotLast.When PutS-NotLast request processing, the directory states of data block
It does not change, and only changes corresponding shared list information.When PutS-Last request processing, the directory states of data block
Become I from S, and corresponding shared list information empties.
Dirty row replacement can send PutM+data request to final stage Cache.The data block directory entry requested access to may be
Change before request processing, has been handled differently.If the data block directory states of dirty row replacement request access are M and are total to
The L1D for enjoying list instruction is by chance the L1D for carrying out this dirty row replacement, then is called PutM+data from Owner request.This
In the case of kind, dirty replacement data will be updated final stage Cache data volume and return to response request;Otherwise, claim the dirty row replacement request
For PutM+data from Non-Owner request.In this case, response request need to only be returned.
It is a certain in final stage Cache catalogue memory bank the present invention is based on the design that final stage Cache carries out catalogue coherency mechanism
The bibliographic structure figure of group is as shown in Figure 4.It can be seen from the figure that the mapping mechanism that final stage Cache uses 8 tunnel groups to be connected, and be
Per one directory entry of distribution all the way.Directory entry is made of two parts of directory states and shared list.Wherein, directory states indicate
Whether whether the caching data block have latest data in final stage Cache and be dirty data;Shared list indicates the caching
Copy situation of the data block in first order storage.Caching number can be known clearly in conjunction with directory states and shared list information
According to block in the concrete condition of on piece, consistency maintenance operation is carried out to it to facilitate.
Directory mechanism during realization can with the different and all differences of assembly line, present invention citing in L1D and
The assembly line realization of final stage Cache is briefly described.
As shown in figure 5, for the L1D assembly line overall structure diagram in specific application example.Its assembly line is by seven grades of groups
At respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5.After L1D receives load, store instruction, two are carried out to it first
It stands the decoding processing (DC1 and DC2) of flowing water, the action type of decision instruction and the function to be realized.Instruction decoding is completed laggard
Row address calculates (EX1).It then, is the process of function realization.The present invention devises reading and executes 3 bats, writes and executes 2 bats
L1DCache memory access assembly line.Wherein, read operation assembly line is in the position EX2, EX3, EX4 in SMAC memory access main pipeline,
It is the judgement of hit missing, memory access/missing processing, memory access output respectively;Write operation assembly line is then in the position EX2, EX3, respectively
It is missing from judgement, memory access/missing processing.As a part of scalar memory access main pipeline, L1DCache assembly line is also by interior
The control of core overall situation halted signals (Stall) and assembly line clear signal.
As shown in fig. 6, for the final stage Cache assembly line overall structure diagram in specific application example.It can be with from figure
Find out, the request into final stage Cache can walk two different paths in assembly line.The reading of L1D returns to request or L1D is returned
Invalid response request without be buffered in input buffering in, and directly by bypass transfer to assembly line Tag_Judge stand, this
For the first paths;The request of data of access DDR memory space need to first be buffered in input buffering, then open from the first level production line
Begin to carry out stream treatment, this is the second paths.Assembly line is made of Pyatyi, respectively Req_Arb, Tag_Wait, Tag_
Judge、Data_Acc、Data_Dec。
The function of realizing below to every grade of flowing water is described in detail.
The assembly line first order (Req_Arb): request requests to carry out round-robin arbitration with Flush in this station, and will arbitration
Request out is sent to next stage flowing water.Meanwhile it also can read significance bit, dirty position and the Tag information of requested data block.
The assembly line second level (Tag_Wait): request only judges whether it is catalog request at this station, and reads mesh
Record information.
The assembly line third level (Tag_Judge): first determining whether request hits, and judges whether again and MBUF if missing
Middle address is related.If relevant miss request, then request is sent to MBUF;It, then will request if incoherent miss request
It is sent to MBUF and OBUF.Hit requests perform corresponding processing according to whether for catalog request.If non-catalog request, then produce
Raw access data volume is enabled;If catalog request, then directory entry information is checked, according to directory states and the difference of shared list
The processing of request is also different.The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access number
It is enabled according to body;Second class needs to wait for the return of L1D data, after pending data arrives at since latest data is not in final stage Cache
It is enabled that access data volume is generated again;Third class needs that Inv-Ack is waited to request, and response to be invalidated generates access after all arriving at again
Data volume is enabled.
The assembly line fourth stage (Data_Acc): the classification of request processing data body is first determined whether.If read operation, then will ask
It asks the data block of access to latch one after data volume reading to clap;If write operation, then first write-in data are encoded, then carry out
Update data volume operation.The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation.
Assembly line level V (Data_Dec): decoded operation is carried out to the request data that upper level is read, and will read to return
Request of data is sent to ring arbitration process module and is handled.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment,
All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention
Range.
Claims (8)
1. a kind of multicore catalogue consistency device towards GPDSP framework characterized by comprising
Kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete removing for peripheral hardware and interior internuclear data
Fortune;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal processing
Unit completes the processing of load, store instruction, and the Monitor Deal processing unit is used to respond the prison of any time arrival
Request is listened, and treatment process is not influenced by Normal Deal processing unit;
On piece final stage Cache, it is distributed to connect on on-chip interconnection network;
Piece external storage DDR, data buffer storage is in L1D and on piece final stage Cache;
On-chip interconnection network can carry out decoding processing after receiving network request first, decode out mesh for receiving network request
Node and purpose equipment after request is sent to corresponding position;
It is divided into several body in the on piece final stage Cache, each body is by input buffer cell IBUF, pipelined units
PipeLine, output buffer cell OBUF and return looped network processing logic unit Rtn NAC composition;The input buffer cell
Enter the request of final stage Cache for being responsible for caching from network-on-chip;The pipelined units are used to from input buffering
It accesses the request of DDR memory space and carries out streamlined processing;The output buffer cell is used to be responsible for caching final stage Cache access
The request of DDR;The return looped network processing logic unit is used to be responsible for carry out to a plurality of types of into network-on-chip request secondary
Make arrangement after due consideration reason.
2. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that further include
MSI directory protocol unit, the request for issuing to L1D carry out consistency maintenance;The MSI directory protocol unit is by M, S, I
Three directory states compositions;M state indicates that data are exclusive by some DSP Core and data are dirty;S state indicates data by one
A or multiple DSP Core are shared and data are clean;I state indicates all DSP Core all without data copy.
3. the multicore catalogue consistency device according to claim 2 towards GPDSP framework, which is characterized in that further include
Contents controller, the contents controller judge the correctness of scheme on protocol hierarchy, for completing to single request not
Clash handle with processing, multiple association requests under directory states and the data block in catalogue " intermediate state " ask correlation
The response processing asked;The contents controller is divided into two classes, and one kind is placed in L1D, another kind of to be placed on piece final stage Cache
In, directory operation carries out in L1D and on piece final stage Cache.
4. the multicore catalogue consistency device according to claim 2 towards GPDSP framework, which is characterized in that described
Storage has the bibliographic structure of complete catalogue in upper final stage Cache, the bibliographic structure be buffered in it is every on piece final stage Cache
One data block distributes directory entry;The directory entry includes directory states and shared list two parts, and the shared list is each
A DSP Core distributes one to indicate whether data have copy in corresponding DSP Core.
5. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that described
Pipeline organization is used in L1D, to complete Instruction decoding, address calculation, read flag and mode bit, judgement hit, data volume
Access and data return to the fluvial processes of operation.
6. the multicore catalogue consistency device according to claim 5 towards GPDSP framework, which is characterized in that the L1D
Pipeline organization be respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5;It is first after L1D receives load, store instruction
The decoding processing of two station flowing water, i.e. DC1 and DC2, the action type of decision instruction and the function to be realized first are carried out to it;Instruction
It is carried out address computation after the completion of decoding, i.e. EX1, completes function and realize;Then it completes to read to execute 3 bats, writes what execution 2 was clapped
L1DCache memory access assembly line, wherein read operation assembly line is in the position EX2, EX3, EX4 in SMAC memory access main pipeline, point
It is not the judgement of hit missing, memory access/missing processing, memory access output;Write operation assembly line is then in the position EX2, EX3, is respectively
Missing judgement, memory access/missing processing.
7. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that described
Pipeline organization is used on piece final stage Cache, to realize read flag and mode bit, reading directory entry, judge to hit, at monitoring
Reason, data volume access and data return to the fluvial processes of operation.
8. the multicore catalogue consistency device according to claim 7 towards GPDSP framework, which is characterized in that described
The pipeline organization of upper final stage Cache includes:
Assembly line first order Req_Arb: it requests to carry out round-robin arbitration with Flush, and the request arbitrated out is sent to next stage stream
Water;Meanwhile going back significance bit, dirty position and the Tag information of read requests data block;
Assembly line second level Tag_Wait: judge whether it is catalog request, and read directory information;
Assembly line third level Tag_Judge: first determining whether request hits, and judges whether again and address in MBUF if missing
It is related;If relevant miss request, then request is sent to MBUF;If incoherent miss request, then request is sent to
MBUF and OBUF;Hit requests perform corresponding processing according to whether for catalog request;If non-catalog request, then visit is generated
Ask that data volume is enabled;If catalog request, then directory entry information is checked, requested according to directory states and the different of shared list
Processing it is also different;The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access data volume
It is enabled;Second class waits L1D data to return, and pending data generates access data volume again after arriving at enabled;Third class waits Inv-Ack
Request, response to be invalidated generates access data volume again after all arriving at enabled;
Assembly line fourth stage Data_Acc: the classification of request processing data body is first determined whether;If read operation, then will request access to
Data block from data volume reading after latch one clap;If write operation, then first write-in data are encoded, then be updated number
Make according to gymnastics;The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation;
Assembly line level V Data_Dec: decoded operation is carried out to the request data that upper level is read, and is asked returned data is read
It asks and is sent to ring arbitration process module and is handled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610503703.5A CN106201939B (en) | 2016-06-30 | 2016-06-30 | Multicore catalogue consistency device towards GPDSP framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610503703.5A CN106201939B (en) | 2016-06-30 | 2016-06-30 | Multicore catalogue consistency device towards GPDSP framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106201939A CN106201939A (en) | 2016-12-07 |
CN106201939B true CN106201939B (en) | 2019-04-05 |
Family
ID=57463707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610503703.5A Active CN106201939B (en) | 2016-06-30 | 2016-06-30 | Multicore catalogue consistency device towards GPDSP framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106201939B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109117396A (en) * | 2018-08-30 | 2019-01-01 | 山东经安纬固消防科技有限公司 | memory access method and system |
CN110704343B (en) * | 2019-09-10 | 2021-01-05 | 无锡江南计算技术研究所 | Data transmission method and device for memory access and on-chip communication of many-core processor |
CN113435153B (en) * | 2021-06-04 | 2022-07-22 | 上海天数智芯半导体有限公司 | Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems |
CN116028418B (en) * | 2023-02-13 | 2023-06-20 | 中国人民解放军国防科技大学 | GPDSP-based extensible multi-core processor, acceleration card and computer |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279428A (en) * | 2013-05-08 | 2013-09-04 | 中国人民解放军国防科学技术大学 | Explicit multi-core Cache consistency active management method facing flow application |
CN103714039A (en) * | 2013-12-25 | 2014-04-09 | 中国人民解放军国防科学技术大学 | Universal computing digital signal processor |
CN104679689A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting |
CN104699631A (en) * | 2015-03-26 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) |
CN105389277A (en) * | 2015-10-29 | 2016-03-09 | 中国人民解放军国防科学技术大学 | Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor) |
CN105718242A (en) * | 2016-01-15 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing) |
-
2016
- 2016-06-30 CN CN201610503703.5A patent/CN106201939B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279428A (en) * | 2013-05-08 | 2013-09-04 | 中国人民解放军国防科学技术大学 | Explicit multi-core Cache consistency active management method facing flow application |
CN103714039A (en) * | 2013-12-25 | 2014-04-09 | 中国人民解放军国防科学技术大学 | Universal computing digital signal processor |
CN104679689A (en) * | 2015-01-22 | 2015-06-03 | 中国人民解放军国防科学技术大学 | Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting |
CN104699631A (en) * | 2015-03-26 | 2015-06-10 | 中国人民解放军国防科学技术大学 | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) |
CN105389277A (en) * | 2015-10-29 | 2016-03-09 | 中国人民解放军国防科学技术大学 | Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor) |
CN105718242A (en) * | 2016-01-15 | 2016-06-29 | 中国人民解放军国防科学技术大学 | Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing) |
Non-Patent Citations (1)
Title |
---|
"X-DSP一级数据Cache的设计与实现";李明;《中国优秀硕士学位论文全文数据库信息科技辑》;20141115;第I137-29页,正文第1-7页第1.1-1.2节,第9-23页第2.1-2.6节 |
Also Published As
Publication number | Publication date |
---|---|
CN106201939A (en) | 2016-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6631448B2 (en) | Cache coherence unit for interconnecting multiprocessor nodes having pipelined snoopy protocol | |
US6636949B2 (en) | System for handling coherence protocol races in a scalable shared memory system based on chip multiprocessing | |
US6697919B2 (en) | System and method for limited fanout daisy chaining of cache invalidation requests in a shared-memory multiprocessor system | |
US6640287B2 (en) | Scalable multiprocessor system and cache coherence method incorporating invalid-to-dirty requests | |
JP3927556B2 (en) | Multiprocessor data processing system, method for handling translation index buffer invalidation instruction (TLBI), and processor | |
US8180981B2 (en) | Cache coherent support for flash in a memory hierarchy | |
US6738868B2 (en) | System for minimizing directory information in scalable multiprocessor systems with logically independent input/output nodes | |
US9740617B2 (en) | Hardware apparatuses and methods to control cache line coherence | |
CN106201939B (en) | Multicore catalogue consistency device towards GPDSP framework | |
US9361233B2 (en) | Method and apparatus for shared line unified cache | |
US20170185515A1 (en) | Cpu remote snoop filtering mechanism for field programmable gate array | |
CN104699631A (en) | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) | |
Chiou et al. | StarT-NG: Delivering seamless parallel computing | |
EP1153349A1 (en) | Non-uniform memory access (numa) data processing system that speculatively forwards a read request to a remote processing node | |
CN110647404A (en) | System, apparatus and method for barrier synchronization in a multithreaded processor | |
US10073782B2 (en) | Memory unit for data memory references of multi-threaded processor with interleaved inter-thread pipeline in emulated shared memory architectures | |
Thakkar et al. | The balance multiprocessor system | |
US20060224840A1 (en) | Method and apparatus for filtering snoop requests using a scoreboard | |
CN109661656A (en) | Method and apparatus for the intelligent storage operation using the request of condition ownership | |
CN103019655B (en) | Towards memory copying accelerated method and the device of multi-core microprocessor | |
US20070073977A1 (en) | Early global observation point for a uniprocessor system | |
WO2017172220A1 (en) | Method, system, and apparatus for a coherency task list to minimize cache snooping between cpu and fpga | |
Gao et al. | System architecture of Godson-3 multi-core processors | |
US9436605B2 (en) | Cache coherency apparatus and method minimizing memory writeback operations | |
US11163682B2 (en) | Systems, methods, and apparatuses for distributed consistency memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |