CN106201939B - Multicore catalogue consistency device towards GPDSP framework - Google Patents

Multicore catalogue consistency device towards GPDSP framework Download PDF

Info

Publication number
CN106201939B
CN106201939B CN201610503703.5A CN201610503703A CN106201939B CN 106201939 B CN106201939 B CN 106201939B CN 201610503703 A CN201610503703 A CN 201610503703A CN 106201939 B CN106201939 B CN 106201939B
Authority
CN
China
Prior art keywords
request
data
final stage
processing
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610503703.5A
Other languages
Chinese (zh)
Other versions
CN106201939A (en
Inventor
刘胜
李昭然
陈海燕
许邦建
鲁建壮
陈俊杰
孔宪停
康子扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201610503703.5A priority Critical patent/CN106201939B/en
Publication of CN106201939A publication Critical patent/CN106201939A/en
Application granted granted Critical
Publication of CN106201939B publication Critical patent/CN106201939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • G06F13/30Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal with priority control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/28DMA
    • G06F2213/2806Space or buffer allocation for DMA transfers

Abstract

A kind of multicore catalogue consistency device towards GPDSP framework, comprising: kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete the carrying of peripheral hardware and interior internuclear data;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal processing unit completes the processing of load, store instruction, the Monitor Deal processing unit is used to respond the monitoring request of any time arrival, and treatment process is not influenced by Normal Deal processing unit;On piece final stage Cache, it is distributed to connect on on-chip interconnection network;Piece external storage DDR, data buffer storage is in L1D and on piece final stage Cache;On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decode out and request is sent to corresponding position after destination node and purpose equipment.The present invention has many advantages, such as that principle is simple and convenient to operate, flexibility is high, applied widely.

Description

Multicore catalogue consistency device towards GPDSP framework
Technical field
Present invention relates generally to processor microarchitecture design fields, refer in particular to a kind of suitable general-purpose digital signal processing The design of device (General-Purpose Digital Signal Processor, GPDSP) multicore storage access.
Background technique
One of the carriage drawn by a team of three horses (CPU, DSP and GPU) as field of processors, DSP is with its high-performance power dissipation ratio, good It the advantages that programmability, low-power consumption, is widely used in embedded system.It is different from CPU, 1) DSP has the following characteristics that Computing capability is strong, and concern, which is implemented to calculate, is better than Focus Control and transaction;2) have for typical signal processing special hard Part is supported, such as multiply-add operation, cyclic addressing;3) common feature of embedded microprocessor, as address and instruction path are few In 32;Non-precisely interrupt;Short-term offline debugging, long-term online resident operation job-program mode (rather than universal cpu is debugged The method run);4) Peripheral Interface is integrated based on quick peripheral hardware, is especially beneficial online transmitting-receiving AD/DA signal, is also supported High speed is direct-connected between DSP.
How high-performance power dissipation ratio and powerful computing capability in view of DSP, upgrade traditional DSP architecture It is allowed to be suitable for high performance cementitious materisl with improvement, is current research hotspot both domestic and external.
As (application number: general-purpose computations digital signal processor proposes Chinese patent application in 201310725118.6) A kind of advantage of the essential characteristic not only having kept DSP and high-performance low-power-consumption, but also can efficiently support the multicore of general scientific algorithm micro- Processor GPDSP.The structure has the feature that direct expression 1) with double-precision floating point and 64 fixed-point datas, general Register, data/address bus, instruction bit wide 64 or more, address-bus width 40 or more;2) CPU and DSP heterogeneous polynuclear Close-coupled, CPU core support complete operating system, the scalar units of DSP core to support operating system micronucleus;3) consider CPU core, The unified programming mode of vector array structure in DSP core and DSP core;4) it keeps its machine to intersect artificial debugging, while local being provided CPU host's debugging mode;5) retain the essential characteristic of the common DSP in addition to digit.
For another example Chinese patent application (GPDSP cooperate at many levels with shared storage device grade access method, application number: 20150135194.0) it is directed to the application demand of GPDSP, proposes the storage architecture of multi-level collaboration and effect.Each DSP core Comprising globally addressable and to the programmable local large capacity scalar of programmer, vector array memory, multiple DSP cores pass through Network-on-chip shares large capacity on piece overall situation Cache;It is provided by direct memory access controller (DMA) and network-on-chip high The internuclear of bandwidth, core built-in storage and overall situation Cache data illustrate, realize monokaryon is interior, between multicore data access collaboration With it is shared.
In GPDSP, on-chip memory and the backstage DMA in core are moved mechanism and are retained, this will be to data consistency Realization brings difficulty.In a kind of Chinese patent application (explicit multicore Cache consistency active management side towards stream application Method, application number: 201310166383.5) using the data coherence management method of software programmable in, will safeguard data one Programmer has been given in the work of cause property, needs the correctness of the productive consumption process of programmer's active management maintenance data.If Hardware management hardware coherence agreement is used in GPDSP, will mitigate the programming complexity of programmer, and then improve The versatility of GPDSP.
Summary of the invention
In view of the problems of the existing technology the technical problem to be solved in the present invention is that the present invention provides a kind of original Reason is simple and convenient to operate, flexibility is high, the multicore catalogue consistency device applied widely towards GPDSP framework.
In order to solve the above technical problems, the invention adopts the following technical scheme:
A kind of multicore catalogue consistency device towards GPDSP framework, comprising:
Kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete peripheral hardware and interior internuclear data Carrying;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal Processing unit completes the processing of load, store instruction, and the Monitor Deal processing unit is used to respond any time arrival Monitoring request, and treatment process is not influenced by Normal Deal processing unit;
On piece final stage Cache, it is distributed to connect on on-chip interconnection network;
Piece external storage DDR, its data can be buffered in L1D and on piece final stage Cache;
On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decoding Request is sent to corresponding position after destination node and purpose equipment out.
As a further improvement of the present invention: be divided into several body in the on piece final stage Cache, each body by Input buffer cell IBUF, pipelined units PipeLine, output buffer cell OBUF and return looped network handle logic unit Rtn NAC composition;The input buffer cell is used to be responsible for the request that caching enters final stage Cache from network-on-chip;The stream Pipeline units are used to carry out streamlined processing to the access DDR memory space request from input buffering;The output buffering is single Member is used to be responsible for the request of caching final stage Cache access DDR;The return looped network processing logic unit is used to be responsible for multiple types The network-on-chip request that enters of type carries out arbitration process.
As a further improvement of the present invention: further including MSI directory protocol unit, the request for issuing to L1D carries out Consistency maintenance;The MSI directory protocol unit is made of tri- directory states of M, S, I;M state indicates data by some DSP Core is exclusive and data are dirty;S state indicates that data are shared by one or more DSP Core and data are clean;I state table Show all DSP Core all without data copy.
As a further improvement of the present invention: further including contents controller, the contents controller is sentenced on protocol hierarchy The correctness of disconnected scheme, for completing to single request at the conflict of processing, multiple association requests under different directories state Reason and the data block for being in catalogue " intermediate state " handle the response of association requests;The contents controller is divided into two classes, and one Class is placed in L1D, another kind of to be placed on piece final stage Cache, and directory operation carries out in L1D and on piece final stage Cache.
As a further improvement of the present invention: storage has the bibliographic structure of complete catalogue, institute in the on piece final stage Cache Stating bibliographic structure is each data block distribution directory entry being buffered on piece final stage Cache;The directory entry includes catalogue shape State and shared list two parts, the shared list are that each DSP Core distributes one to indicate data whether corresponding There is copy in DSP Core.
As a further improvement of the present invention: in the L1D use pipeline organization, to complete Instruction decoding, Location calculating, read flag and mode bit, judgement hit, data volume access and data return to the fluvial processes operated.
As a further improvement of the present invention: the pipeline organization of the L1D be respectively DC1, DC2, EX1, EX2, EX3, EX4,EX5;When L1D receive load, store instruction after, it is carried out first two station flowing water decoding processing, i.e. DC1 and DC2, the action type of decision instruction and the function to be realized;It is carried out address computation after the completion of Instruction decoding, i.e. EX1, completes function It is able to achieve;Then it completes to read to execute 3 bats, writes and execute the 2 L1DCache memory access assembly lines clapped, wherein read operation assembly line is in SMAC It is in the position EX2, EX3, EX4 in memory access main pipeline, is the judgement of hit missing, memory access/missing processing, memory access output respectively; Write operation assembly line is then in the position EX2, EX3, is missing from judgement, memory access/missing processing respectively.
As a further improvement of the present invention: using pipeline organization in final stage Cache on said sheets, read to realize Label and mode bit read the fluvial processes that directory entry, judgement hit, monitoring processing, data volume access and data return to operation.
As a further improvement of the present invention: the pipeline organization of the on piece final stage Cache includes:
Assembly line first order Req_Arb: it requests to carry out round-robin arbitration with Flush, and the request arbitrated out is sent to next Grade flowing water;Meanwhile going back significance bit, dirty position and the Tag information of read requests data block;
Assembly line second level Tag_Wait: judge whether it is catalog request, and read directory information;
Assembly line third level Tag_Judge: first determine whether request whether hit, if missing if judge whether again in MBUF Address is related;If relevant miss request, then request is sent to MBUF;If incoherent miss request, then request is sent out To MBUF and OBUF;Hit requests perform corresponding processing according to whether for catalog request;If non-catalog request, then generate It is enabled to access data volume;If catalog request, then directory entry information is checked, asked according to the difference of directory states and shared list The processing asked is also different;The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access data Body is enabled;Second class waits L1D data to return, and pending data generates access data volume again after arriving at enabled;Third class waits Inv- Ack request, response to be invalidated generates access data volume again after all arriving at enabled;
Assembly line fourth stage Data_Acc: the classification of request processing data body is first determined whether;It, then will request if read operation The data block of access latches one after data volume reading and claps;If write operation, then first write-in data are encoded, then carry out more New data gymnastics is made;The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation;
Assembly line level V Data_Dec: decoded operation is carried out to the request data that upper level is read, and will read to return to number Upper ring arbitration process module is sent to according to request to be handled.
Compared with the prior art, the advantages of the present invention are as follows:
1, the multicore catalogue consistency device of the invention towards GPDSP framework, principle is simple and convenient to operate, flexibility It is high, applied widely, it is same to the data sign processing of DMA request other than supporting the data sign processing of L1D request It supports, has widened application field.
2, the multicore catalogue consistency device of the invention towards GPDSP framework utilizes catalogue control using contents controller Device mechanism processed can judge the correctness of multicore data consistency scheme on protocol hierarchy, substantially reduce multicore catalogue consistency The design and proving period of device.
Detailed description of the invention
Fig. 1 is present invention principle schematic diagram in specific application example.
Fig. 2 is that the present invention instructs handling principle schematic diagram in specific application example;Wherein (a) is instruction handling principle figure (1), (b) to instruct handling principle figure (2), it is (c) instruction handling principle figure (3), (d) is handled for level one data Cache replacement Schematic diagram.
Fig. 3 is present invention DMA handling principle schematic diagram in specific application example;Wherein (a) is that DMA read request processing is former Reason figure (b) is DMA write request handling principle figure.
Fig. 4 is the present invention a certain group of bibliographic structure figure in final stage Cache catalogue memory bank in specific application example.
Fig. 5 is present invention L1D assembly line overall structure diagram in specific application example.
Fig. 6 is present invention final stage Cache assembly line overall structure diagram in specific application example.
Specific embodiment
The present invention is described in further details below with reference to Figure of description and specific embodiment.
In specific application example, GPDSP framework includes N number of core;GetX request indicates GetS request or GetM request; Fwd-GetX request indicates Fwd-GetS request or Fwd-GetM request;PutX request indicates PutS request or PutM+ Data request;Catalogue X state indicates catalogue S state or catalogue M state;Catalogue Y state indicates catalogue S state or catalogue I State.
As shown in Figure 1, the multicore catalogue consistency device of the invention towards GPDSP framework, comprising:
Kernel includes DMA and level one data Cache (L1D);Wherein DMA is used to complete removing for peripheral hardware and interior internuclear data Fortune, needs the configuration of programmer before starting.There are two parallel processing elements of Normal Deal and Monitor Deal in L1D.Its In, Normal Deal processing unit completes the processing of load, store instruction.It, may be also if instruction judges missing in L1D Processing need to be replaced;Monitor Deal processing unit can respond the monitoring request of any time arrival, and treatment process It is not influenced by Normal Deal processing unit.
On piece final stage Cache, it is distributed to connect on on-chip interconnection network, N number of body can be divided into.Every individual is by defeated Enter buffer cell (IBUF), pipelined units (PipeLine), output buffer cell (OBUF) and returns to looped network processing logic Unit (Rtn NAC) composition.Wherein, input buffer cell is used to be responsible for the request that caching enters final stage Cache from network-on-chip; Pipelined units are used to carry out streamlined processing to the access DDR memory space request from input buffering;Export buffer cell For being responsible for the request of caching final stage Cache access DDR;Return looped network processing logic unit be used to be responsible for it is a plurality of types of into Enter network-on-chip request and carries out arbitration process.
Piece external storage DDR, also referred to as main memory, data can be buffered in L1D and final stage Cache.
On-chip interconnection network can carry out decoding processing after receiving network request for receiving network request first, decoding Request is sent to corresponding position after destination node and purpose equipment out.In addition to this, on-chip interconnection network also can be to from end The request of grade Cache is handled, and principle is similar with a upper situation.The network request is other except the local request of removing Request, and the local request for requesting to issue for the corresponding core of individual every in final stage Cache.
The following table 1 is the detailed description of all catalog requests, it comprises during directory operation it is possible that it is all Request type is conveniently introduced catalogue coherency mechanism with this.
Table 1
The present invention in the specific application process, further uses the MSI directory protocol of extension.Basic catalogue is assisted first View is described in detail, and basic directory protocol only carries out consistency maintenance to L1D.
After L1D receives instruction, instruction can be operated at once.According to instruction access data block L1D caching And its difference of dirty bit information, treatment process would also vary from.Wherein, L1D read hit or it is dirty write under hit situation, directly It connects and instruction is handled.In addition to this all situations, L1D can send request to final stage Cache to cause next step Operation.
In the specific application process, as shown in Fig. 2, the present invention can be divided into four classes according to the complexity of operation, point (a) Wei not schemed, figure (b), figure (c), scheme situation shown in (d).
Figure (a) is instruction handling principle figure (1).It is completed it can be seen from the figure that instruction operates in two steps.According to instruction The difference of classification can be divided into two kinds of situations.After L1D reads missing, GetS can be generated and request and be sent to final stage Cache.End After grade Cache receives GetS request, directory entry is first looked at.Since directory states are I or S, show that latest data caches In final stage Cache.Therefore, data are directly taken out and return to request source by final stage Cache;After L1D writes missing, it can generate GetM is requested and is sent to final stage Cache.After final stage Cache receives GetM request, directory entry is first looked at.Due to catalogue State is I, shows that latest data is buffered in final stage Cache.Therefore, data are directly taken out and are returned to and asked by final stage Cache Ask source.
Figure (b) is instruction handling principle figure (2).It is completed it can be seen from the figure that instruction operates in three steps.According to instruction The difference of classification can be divided into two kinds of situations.After L1D reads missing, GetS can be generated and request and be sent to final stage Cache.End After grade Cache receives GetS request, directory entry is first looked at.Since directory states are M, show latest data not in final stage In Cache.Therefore, final stage Cache can send Fwd-GetS and request to the L1D for possessing latest data copy.The L1D is received After Fwd-GetS request, it can be sent simultaneously to final stage Cache and request source and read returned data request, and local directory is become into S State;After L1D writes missing, GetM can be generated and request and be sent to final stage Cache.After final stage Cache receives GetM request, First look at directory entry.Since directory states are M, show latest data not in final stage Cache.Therefore, final stage Cache can be sent out Fwd-GetM is sent to request to the L1D for possessing latest data copy.After the L1D receives Fwd-GetM request, only sent out to request source It send and reads returned data request, and local directory is become into I state.
Figure (c) is instruction handling principle figure (3).It is completed it can be seen from the figure that instruction operates in three steps.According to instruction Whether hit with whether data are dirty difference, two kinds of situations can be divided into.After L1D writes missing, GetM can be generated and request and incite somebody to action It is sent to final stage Cache.After final stage Cache receives GetM request, directory entry is first looked at.Since directory states are S, final stage Cache can send Inv-L request to all L1D for possessing data copy according to list information is shared, and read simultaneously data and return Back to requestor, and carry Ack signal.After L1D receives Inv-L request, local data block can be destroyed, and send out to requestor Inv-Ack is sent to request.The L1D of missing is write after receiving data and all invalid responses, just carries out write operation;L1D is dry Only after writing hit, GetM request can also be generated.Since its handling principle is consistent with deletion condition is write, it is not discussed herein.
Figure (d) is L1D replacement handling principle figure.It is completed it can be seen from the figure that instruction operates in two steps.According to data Whether be it is dirty, two kinds of situations can be divided into.When replacement is clean, L1D can send PutS request to final stage Cache.Final stage After Cache receives PutS request, shared list information will be updated, if updated shared list is complete zero, also need to repair Change directory states, so that it becomes I.After the completion of operation, final stage Cache can send Put-Ack request to requestor;Work as replacement When dirty, L1D can send PutM+data request to final stage Cache.After final stage Cache receives PutM+data request, it will be updated Data volume and directory information.After the completion of operation, final stage Cache can send Put-Ack request to requestor.
In dsp, DMA is needed to carry operation in peripheral hardware and the interior internuclear a large amount of data of progress.If not carried out to DMA consistent Property maintenance, inevitably will appear the inconsistent problem of multicore data;If being carried out to DMA using the software-hardware synergism mechanism of synchronization unit Consistency maintenance needs programmer to monitor memory space disposition in real time, proposes no small challenge to programmer.Therefore, originally Invention has carried out protocol extension to base directory agreement, it is made also to support to carry out consistency maintenance operation to DMA request.DMA is logical It crosses interference networks and directly accesses final stage Cache.According to the difference of access memory bank operation, DMA request can be divided into two classes, point It Wei not situation shown in Fig. 3.
Scheming (a) is DMA read request handling principle figure.After final stage Cache receives DMA read request, directory information can be checked, And corresponding processing is done according to the difference of directory information.When catalogue is I, S state, since latest data is in final stage Cache, DMA read request is directly returned to DMA after reading data;When catalogue is M state, since final stage Cache does not have latest data, It can send Fwd-Rd request to the L1D for possessing latest copy.It, can be simultaneously to final stage after the L1D receives Fwd-Rd request Cache and DMA, which is sent, reads returned data request, and catalogue is become S state.
Scheming (b) is DMA write request handling principle figure.After final stage Cache receives DMA write request, directory information can be checked, And corresponding processing is done according to the difference of directory information.When catalogue is I state, since latest data is in final stage Cache, DMA write request will be updated data volume, and return to answer signal to DMA after operation is completed;When catalogue is M state, due to not having There is latest data, final stage Cache can send Fwd-Wrt request to the L1D for possessing latest copy.The L1D receives Fwd-Wrt and asks After asking, is only sent to final stage Cache and read returned data request, and destroy local data block.Final stage Cache receives newest number After return, the data carried with DMA write request are integrated, and then update data volume and directory information again.After the completion of operation, Final stage Cache can return to answer signal to DMA;When catalogue is S state, final stage Cache meeting basis shares list information to institute There is the L1D for possessing data copy to send Inv-DE request.After these L1D receive Inv-DE request, local data can be destroyed Block, and Inv-Ack request is sent to final stage Cache.After final stage Cache receives all invalid responses, write operation is just carried out, And answer signal is returned to DMA after operation is completed.
Directory operation have atomicity, but design realize during will appear many conflicts the case where.Such as: in chip In global angle, before previous catalog request processing is completed, the latter relevant catalog request can generate conflict when arriving at.For These conflict situations, the present invention are resolved using contents controller mechanism.According to the position where contents controller by its point It is two kinds, respectively L1D contents controller and final stage Cache contents controller.
The following table 2 .1,2.2,2.3 are the detailed descriptions of L1D contents controller.As can be seen from the table, data block catalogue shape State is other than tri- stable states of M, S, I, and there is also very much " centre " states.Such as: L1D reads missing to missing data return period Between, the data block of read request access is constantly in ISDState.Data block in " centre " state can be relevant with response section Request is monitored, (stall) processing that can only be paused is requested for the monitoring that cannot be responded.It not only ensure that data are consistent in this way Property, and improve system performance.Such as: line to be replaced directory states become MI after the dirty replacement of L1DA.Phase is sent in final stage Cache Before the answer signal answered arrives at, if this L1D receives a relevant Fwd-Rd request, it can respond and terminate in operation at once When directory states are become into SIA
Table 2.1
Table 2.2
Table 2.3
It is more special completely to write hit operation by L1D in the directory protocol that the present invention designs.Under normal circumstances, request hit nothing Need to fetch evidence to next stage Cache.In order to simplify directory protocol complexity, it is write into missing processing with L1D and is classified as same class.
The invalidation request that L1D is received is divided into two kinds, and respectively Inv-L request and Inv-DE are requested.Since the present invention is set The consistency maintenance to DMA is increased in the directory protocol of meter, and is operated in such a way that writing is useless.Therefore, DMA write request pair When final stage Cache is operated, answer signal should be returned to end by issuing each Inv-DE request for possessing data copy L1D Grade Cache.And L1D may also can cause the useless operation of writing (sending Inv-L request) when executing store instruction, but without effect It answers signal and returns to L1D.Different types of invalidation request, the equipment that answer signal returns is different.Therefore, originally Invention classifies to it.
Invalid response (Inv-Ack) request can according to circumstances do corresponding processing after arriving at L1D.If it is not the last one Invalid response request, then request the directory states of corresponding data block not change;Otherwise, the catalogue shape of corresponding data block is requested State is from IMAOr SMABecome M.
The data for arriving at L1D read source, respectively final stage Cache and others L1D there are two return requests.And due to portion Divided data, which is read to return, requests to need to carry Ack information (record needs the invalid response number returned), it is therefore desirable to carry out to it Differentiation processing.As shown in the first row of table 2.3, data read return request can be divided into from owner L1D and do not carry Ack, From owner L1D and carries Ack, do not carry from final stage Cache and Ack, from final stage Cache and Ack is 0, from end Grade Cache and Ack be greater than 0 five kinds of situations.
The following table 3 .1,3.2,3.3 are the detailed descriptions of final stage Cache contents controller.Similar to L1D contents controller, There is also " centre " states in final stage Cache contents controller.Such as: L1D can be asked after reading missing to final stage Cache sending GetS It asks.Directory states can become S when the data block requested access to is in M state in final stage CacheD, until newest data It returns.Invalid response (Inv-Ack) request is similar with L1D contents controller, is not discussed herein.
Table 3.1
Table 3.2
Table 3.3
L1D replacement is divided into clean row replacement and dirty row replaces two kinds.
Clean row replacement can send PutS request to final stage Cache.There may be multiple cores one shared in actual operation The case where data block, at this moment PutS request can be divided into the last one non-(PutS- according to the sequence for arriving at final stage Cache ) and the last one (PutS-Last) two kinds of situations NotLast.When PutS-NotLast request processing, the directory states of data block It does not change, and only changes corresponding shared list information.When PutS-Last request processing, the directory states of data block Become I from S, and corresponding shared list information empties.
Dirty row replacement can send PutM+data request to final stage Cache.The data block directory entry requested access to may be Change before request processing, has been handled differently.If the data block directory states of dirty row replacement request access are M and are total to The L1D for enjoying list instruction is by chance the L1D for carrying out this dirty row replacement, then is called PutM+data from Owner request.This In the case of kind, dirty replacement data will be updated final stage Cache data volume and return to response request;Otherwise, claim the dirty row replacement request For PutM+data from Non-Owner request.In this case, response request need to only be returned.
It is a certain in final stage Cache catalogue memory bank the present invention is based on the design that final stage Cache carries out catalogue coherency mechanism The bibliographic structure figure of group is as shown in Figure 4.It can be seen from the figure that the mapping mechanism that final stage Cache uses 8 tunnel groups to be connected, and be Per one directory entry of distribution all the way.Directory entry is made of two parts of directory states and shared list.Wherein, directory states indicate Whether whether the caching data block have latest data in final stage Cache and be dirty data;Shared list indicates the caching Copy situation of the data block in first order storage.Caching number can be known clearly in conjunction with directory states and shared list information According to block in the concrete condition of on piece, consistency maintenance operation is carried out to it to facilitate.
Directory mechanism during realization can with the different and all differences of assembly line, present invention citing in L1D and The assembly line realization of final stage Cache is briefly described.
As shown in figure 5, for the L1D assembly line overall structure diagram in specific application example.Its assembly line is by seven grades of groups At respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5.After L1D receives load, store instruction, two are carried out to it first It stands the decoding processing (DC1 and DC2) of flowing water, the action type of decision instruction and the function to be realized.Instruction decoding is completed laggard Row address calculates (EX1).It then, is the process of function realization.The present invention devises reading and executes 3 bats, writes and executes 2 bats L1DCache memory access assembly line.Wherein, read operation assembly line is in the position EX2, EX3, EX4 in SMAC memory access main pipeline, It is the judgement of hit missing, memory access/missing processing, memory access output respectively;Write operation assembly line is then in the position EX2, EX3, respectively It is missing from judgement, memory access/missing processing.As a part of scalar memory access main pipeline, L1DCache assembly line is also by interior The control of core overall situation halted signals (Stall) and assembly line clear signal.
As shown in fig. 6, for the final stage Cache assembly line overall structure diagram in specific application example.It can be with from figure Find out, the request into final stage Cache can walk two different paths in assembly line.The reading of L1D returns to request or L1D is returned Invalid response request without be buffered in input buffering in, and directly by bypass transfer to assembly line Tag_Judge stand, this For the first paths;The request of data of access DDR memory space need to first be buffered in input buffering, then open from the first level production line Begin to carry out stream treatment, this is the second paths.Assembly line is made of Pyatyi, respectively Req_Arb, Tag_Wait, Tag_ Judge、Data_Acc、Data_Dec。
The function of realizing below to every grade of flowing water is described in detail.
The assembly line first order (Req_Arb): request requests to carry out round-robin arbitration with Flush in this station, and will arbitration Request out is sent to next stage flowing water.Meanwhile it also can read significance bit, dirty position and the Tag information of requested data block.
The assembly line second level (Tag_Wait): request only judges whether it is catalog request at this station, and reads mesh Record information.
The assembly line third level (Tag_Judge): first determining whether request hits, and judges whether again and MBUF if missing Middle address is related.If relevant miss request, then request is sent to MBUF;It, then will request if incoherent miss request It is sent to MBUF and OBUF.Hit requests perform corresponding processing according to whether for catalog request.If non-catalog request, then produce Raw access data volume is enabled;If catalog request, then directory entry information is checked, according to directory states and the difference of shared list The processing of request is also different.The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access number It is enabled according to body;Second class needs to wait for the return of L1D data, after pending data arrives at since latest data is not in final stage Cache It is enabled that access data volume is generated again;Third class needs that Inv-Ack is waited to request, and response to be invalidated generates access after all arriving at again Data volume is enabled.
The assembly line fourth stage (Data_Acc): the classification of request processing data body is first determined whether.If read operation, then will ask It asks the data block of access to latch one after data volume reading to clap;If write operation, then first write-in data are encoded, then carry out Update data volume operation.The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation.
Assembly line level V (Data_Dec): decoded operation is carried out to the request data that upper level is read, and will read to return Request of data is sent to ring arbitration process module and is handled.
The above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art For those of ordinary skill, several improvements and modifications without departing from the principles of the present invention should be regarded as protection of the invention Range.

Claims (8)

1. a kind of multicore catalogue consistency device towards GPDSP framework characterized by comprising
Kernel, includes DMA and L1D, and L1D is level one data Cache;The DMA is used to complete removing for peripheral hardware and interior internuclear data Fortune;The L1D includes two parallel processing elements of Normal Deal and Monitor Deal, the Normal Deal processing Unit completes the processing of load, store instruction, and the Monitor Deal processing unit is used to respond the prison of any time arrival Request is listened, and treatment process is not influenced by Normal Deal processing unit;
On piece final stage Cache, it is distributed to connect on on-chip interconnection network;
Piece external storage DDR, data buffer storage is in L1D and on piece final stage Cache;
On-chip interconnection network can carry out decoding processing after receiving network request first, decode out mesh for receiving network request Node and purpose equipment after request is sent to corresponding position;
It is divided into several body in the on piece final stage Cache, each body is by input buffer cell IBUF, pipelined units PipeLine, output buffer cell OBUF and return looped network processing logic unit Rtn NAC composition;The input buffer cell Enter the request of final stage Cache for being responsible for caching from network-on-chip;The pipelined units are used to from input buffering It accesses the request of DDR memory space and carries out streamlined processing;The output buffer cell is used to be responsible for caching final stage Cache access The request of DDR;The return looped network processing logic unit is used to be responsible for carry out to a plurality of types of into network-on-chip request secondary Make arrangement after due consideration reason.
2. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that further include MSI directory protocol unit, the request for issuing to L1D carry out consistency maintenance;The MSI directory protocol unit is by M, S, I Three directory states compositions;M state indicates that data are exclusive by some DSP Core and data are dirty;S state indicates data by one A or multiple DSP Core are shared and data are clean;I state indicates all DSP Core all without data copy.
3. the multicore catalogue consistency device according to claim 2 towards GPDSP framework, which is characterized in that further include Contents controller, the contents controller judge the correctness of scheme on protocol hierarchy, for completing to single request not Clash handle with processing, multiple association requests under directory states and the data block in catalogue " intermediate state " ask correlation The response processing asked;The contents controller is divided into two classes, and one kind is placed in L1D, another kind of to be placed on piece final stage Cache In, directory operation carries out in L1D and on piece final stage Cache.
4. the multicore catalogue consistency device according to claim 2 towards GPDSP framework, which is characterized in that described Storage has the bibliographic structure of complete catalogue in upper final stage Cache, the bibliographic structure be buffered in it is every on piece final stage Cache One data block distributes directory entry;The directory entry includes directory states and shared list two parts, and the shared list is each A DSP Core distributes one to indicate whether data have copy in corresponding DSP Core.
5. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that described Pipeline organization is used in L1D, to complete Instruction decoding, address calculation, read flag and mode bit, judgement hit, data volume Access and data return to the fluvial processes of operation.
6. the multicore catalogue consistency device according to claim 5 towards GPDSP framework, which is characterized in that the L1D Pipeline organization be respectively DC1, DC2, EX1, EX2, EX3, EX4, EX5;It is first after L1D receives load, store instruction The decoding processing of two station flowing water, i.e. DC1 and DC2, the action type of decision instruction and the function to be realized first are carried out to it;Instruction It is carried out address computation after the completion of decoding, i.e. EX1, completes function and realize;Then it completes to read to execute 3 bats, writes what execution 2 was clapped L1DCache memory access assembly line, wherein read operation assembly line is in the position EX2, EX3, EX4 in SMAC memory access main pipeline, point It is not the judgement of hit missing, memory access/missing processing, memory access output;Write operation assembly line is then in the position EX2, EX3, is respectively Missing judgement, memory access/missing processing.
7. the multicore catalogue consistency device according to claim 1 towards GPDSP framework, which is characterized in that described Pipeline organization is used on piece final stage Cache, to realize read flag and mode bit, reading directory entry, judge to hit, at monitoring Reason, data volume access and data return to the fluvial processes of operation.
8. the multicore catalogue consistency device according to claim 7 towards GPDSP framework, which is characterized in that described The pipeline organization of upper final stage Cache includes:
Assembly line first order Req_Arb: it requests to carry out round-robin arbitration with Flush, and the request arbitrated out is sent to next stage stream Water;Meanwhile going back significance bit, dirty position and the Tag information of read requests data block;
Assembly line second level Tag_Wait: judge whether it is catalog request, and read directory information;
Assembly line third level Tag_Judge: first determining whether request hits, and judges whether again and address in MBUF if missing It is related;If relevant miss request, then request is sent to MBUF;If incoherent miss request, then request is sent to MBUF and OBUF;Hit requests perform corresponding processing according to whether for catalog request;If non-catalog request, then visit is generated Ask that data volume is enabled;If catalog request, then directory entry information is checked, requested according to directory states and the different of shared list Processing it is also different;The processing mode of catalog request is divided into three classes: the first kind is directly operated, and generates access data volume It is enabled;Second class waits L1D data to return, and pending data generates access data volume again after arriving at enabled;Third class waits Inv-Ack Request, response to be invalidated generates access data volume again after all arriving at enabled;
Assembly line fourth stage Data_Acc: the classification of request processing data body is first determined whether;If read operation, then will request access to Data block from data volume reading after latch one clap;If write operation, then first write-in data are encoded, then be updated number Make according to gymnastics;The data of reading can be finally sent to next stage flowing water by the request for carrying out data volume read operation;
Assembly line level V Data_Dec: decoded operation is carried out to the request data that upper level is read, and is asked returned data is read It asks and is sent to ring arbitration process module and is handled.
CN201610503703.5A 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework Active CN106201939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610503703.5A CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610503703.5A CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Publications (2)

Publication Number Publication Date
CN106201939A CN106201939A (en) 2016-12-07
CN106201939B true CN106201939B (en) 2019-04-05

Family

ID=57463707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610503703.5A Active CN106201939B (en) 2016-06-30 2016-06-30 Multicore catalogue consistency device towards GPDSP framework

Country Status (1)

Country Link
CN (1) CN106201939B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117396A (en) * 2018-08-30 2019-01-01 山东经安纬固消防科技有限公司 memory access method and system
CN110704343B (en) * 2019-09-10 2021-01-05 无锡江南计算技术研究所 Data transmission method and device for memory access and on-chip communication of many-core processor
CN113435153B (en) * 2021-06-04 2022-07-22 上海天数智芯半导体有限公司 Method for designing digital circuit interconnected by GPU (graphics processing Unit) cache subsystems
CN116028418B (en) * 2023-02-13 2023-06-20 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279428A (en) * 2013-05-08 2013-09-04 中国人民解放军国防科学技术大学 Explicit multi-core Cache consistency active management method facing flow application
CN103714039A (en) * 2013-12-25 2014-04-09 中国人民解放军国防科学技术大学 Universal computing digital signal processor
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN105389277A (en) * 2015-10-29 2016-03-09 中国人民解放军国防科学技术大学 Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor)
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279428A (en) * 2013-05-08 2013-09-04 中国人民解放军国防科学技术大学 Explicit multi-core Cache consistency active management method facing flow application
CN103714039A (en) * 2013-12-25 2014-04-09 中国人民解放军国防科学技术大学 Universal computing digital signal processor
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN105389277A (en) * 2015-10-29 2016-03-09 中国人民解放军国防科学技术大学 Scientific computation-oriented high performance DMA (Direct Memory Access) part in GPDSP (General-Purpose Digital Signal Processor)
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"X-DSP一级数据Cache的设计与实现";李明;《中国优秀硕士学位论文全文数据库信息科技辑》;20141115;第I137-29页,正文第1-7页第1.1-1.2节,第9-23页第2.1-2.6节

Also Published As

Publication number Publication date
CN106201939A (en) 2016-12-07

Similar Documents

Publication Publication Date Title
US6631448B2 (en) Cache coherence unit for interconnecting multiprocessor nodes having pipelined snoopy protocol
US6636949B2 (en) System for handling coherence protocol races in a scalable shared memory system based on chip multiprocessing
US6697919B2 (en) System and method for limited fanout daisy chaining of cache invalidation requests in a shared-memory multiprocessor system
US6640287B2 (en) Scalable multiprocessor system and cache coherence method incorporating invalid-to-dirty requests
JP3927556B2 (en) Multiprocessor data processing system, method for handling translation index buffer invalidation instruction (TLBI), and processor
US8180981B2 (en) Cache coherent support for flash in a memory hierarchy
US6738868B2 (en) System for minimizing directory information in scalable multiprocessor systems with logically independent input/output nodes
US9740617B2 (en) Hardware apparatuses and methods to control cache line coherence
CN106201939B (en) Multicore catalogue consistency device towards GPDSP framework
US9361233B2 (en) Method and apparatus for shared line unified cache
US20170185515A1 (en) Cpu remote snoop filtering mechanism for field programmable gate array
CN104699631A (en) Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
Chiou et al. StarT-NG: Delivering seamless parallel computing
EP1153349A1 (en) Non-uniform memory access (numa) data processing system that speculatively forwards a read request to a remote processing node
CN110647404A (en) System, apparatus and method for barrier synchronization in a multithreaded processor
US10073782B2 (en) Memory unit for data memory references of multi-threaded processor with interleaved inter-thread pipeline in emulated shared memory architectures
Thakkar et al. The balance multiprocessor system
US20060224840A1 (en) Method and apparatus for filtering snoop requests using a scoreboard
CN109661656A (en) Method and apparatus for the intelligent storage operation using the request of condition ownership
CN103019655B (en) Towards memory copying accelerated method and the device of multi-core microprocessor
US20070073977A1 (en) Early global observation point for a uniprocessor system
WO2017172220A1 (en) Method, system, and apparatus for a coherency task list to minimize cache snooping between cpu and fpga
Gao et al. System architecture of Godson-3 multi-core processors
US9436605B2 (en) Cache coherency apparatus and method minimizing memory writeback operations
US11163682B2 (en) Systems, methods, and apparatuses for distributed consistency memory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant