CN114218140A

CN114218140A - Method and system for implementing arbitration mechanism in L2

Info

Publication number: CN114218140A
Application number: CN202111549385.3A
Authority: CN
Inventors: 李长林; 刘磊
Original assignee: Guangdong Saifang Technology Co ltd
Current assignee: Guangdong Saifang Technology Co ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-22

Abstract

The invention relates to the technical field of CPUs (central processing units), in particular to a method and a system for realizing an arbitration mechanism in L2, which comprise the following steps: receiving an arbitration request from a queue with weight information, such as crq cwq esq efq orq; selecting one of the arbitration requests in the S1 to enter pipeline according to the weight information of the requests; judging the weight information of the request, if two or more same weights in the highest weight are the same, according to efq > esq > orq < crq/cwq; if the weight requests a consistent arbitration request, then the polling implements its priority. The invention can avoid the situation that the queue of a certain core is full and other core queues are empty because a certain core does not receive response due to too many requests, thereby ensuring the balance of the queues, avoiding the frequent occurrence of the scene that the certain core cannot continuously send request to the L2, and effectively improving the overall performance of the CPU.

Description

Method and system for implementing arbitration mechanism in L2

Technical Field

The invention relates to the technical field of CPUs (central processing units), in particular to a method and a system for realizing an arbitration mechanism in L2.

Background

The L2 arbitrates a pipeline on request among multiple requests, including read requests and write requests from multiple cores, snoop requests from the outside, and backfill requests from the inside of the L2, which have great heterogeneity, mainly including a large number of read and write requests issued by each core, and a number of requests requiring timely response.

In the prior art, the fastest response needs to be obtained in the core or the subordinate memory, special treatment cannot be obtained in L2 arbitration, or a round random mode is adopted, so that the request cannot be responded in the most timely manner, and the overall performance of the CPU is influenced.

The requests sent by the core can not obtain higher priority according to the core requests with large number of yet-responded requests, which easily causes that a core cannot respond due to too many requests, so that the queue for caching the core request is full quickly, and further causes that the core cannot continuously send request requests to the L2, thereby affecting the overall performance of the CPU.

Therefore, in order to ensure that requests requiring timely response can be responded faster, requests sent by different cores can be responded faster according to the core requests with large number of requests not yet responded, namely, the requests have higher priority in the arbitration of the L2, each request is given a weight again, the arbitration module finally determines the priority among the requests according to the source of the requests and the weight of the requests, so that the requests requiring timely response can obtain the highest priority, and in the requests of a core, too many requests not yet responded can obtain higher priority, so that the overall performance of the CPU is effectively improved.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses a method and a system for implementing an arbitration mechanism in L2, which are used for solving the problems.

The invention is realized by the following technical scheme:

in a second aspect, the present invention provides a method for implementing the arbitration mechanism in L2, including the following steps:

s1 receives an arbitration request from a queue with weight information such as crq cwq esq efq orq;

s2, selecting one of the arbitration requests in S1 to enter pipeline according to the weight information of the requests;

s3 judging the weight information of the request, if two or more same weights in the highest weight, according to efq > esq > orq < crq/cwq;

s4 polls the arbitration request for priority if the request is consistent.

Furthermore, in the method, crq and cwq are in accordance with the weight, priority polling, the last time crq gets arbitration, then the next time cwq has higher priority than crq; if the arbitration was obtained the last time cwq, then the priority of the next time crq is higher than the priority of cwq.

Furthermore, in the method, when the weight is distributed, if the base level of efq is 2 or 3, the weights of 2-8 or 3-8 are separated according to the number of efq entry items.

Furthermore, in the method, when the basic level of orq is 1 during weight distribution, 1-8 weights are separated according to the number of orq entry items.

Furthermore, in the method, when the weight is distributed, if the basic level of crq/cwr is 0, 0-8 weights are separated according to the number of crq/cwr entries.

Further, in the method, the L2 is operable to include the steps of:

t1 receives requests from core and external extension, and simultaneously receives requests from CRQ CWQ EPQ ORQ EFQ CPQ, and selects one of the requests to enter pipeline;

the T2 generates a related distribution scheme according to the type of the request, the TAG information and the MESI state information in the current L2;

t3 sends out read request or write request to downstream memory through L2, and returns the load data back to EFQ by the downstream memory;

t4 writes the data back from the load to L2$ via L2de pipeline on EFQ, while returning the data to the requested core;

wherein, L2 receives requests from core and external extension, including receiving read request from core, and puts it in CRQ; receive the write request from core, at CWQ; a probe request from the outside is received and placed in the EPQ.

Further, in the method, in pipeline, according to the type of the request, and the TAG information and the MESI status information in the current L2, it is determined that: whether or not L2 can be written directly; whether the return data can be directly sent to the requester; whether downstream memory reload data or permission is needed; whether evict needs to be generated; whether downstream memory write data is needed; whether a probe core is required; if the data or the authority of the downstream memory is needed to be uploaded, an ORQ is distributed; if the downstream memory write data is needed, allocating a WRQ; if the probe core is needed, a CPQ is allocated, and the L2 sends a probe request to the core, which are all completed through the CPQ.

Furthermore, in the method, a read request is sent to the downstream memory through the L2, and then the read request is sent through the ORQ, data is read to the downstream memory, and the right corresponding to the data is taken; and sending a write request to the downstream memory through the L2, and writing data from the L2 to the next-level memory through the WRQ.

In a second aspect, the present invention provides a system implemented by the arbitration mechanism in L2, where the system is configured to implement the method implemented by the arbitration mechanism in L2 described in the first aspect, and the method includes probe, event, TAG RAM, DATA RAM, CRQ, CWQ, EPQ, ORQ, WRQ, EFQ, and CPQ.

Furthermore, the probe is used for peeking and monitoring, and the MESI state information in the core is modified to remove the dirty data probe in the core or to obtain the E authority;

the evaluation is used for keeping the data stored in the cache relatively new and is generated when the data needs to be replaced in the cache;

the TAG RAM is used for recording addr of cacheline and MESI state information of cacheline in L2 and all L2 CORE;

the DATA RAM is used for recording cacheline DATA information;

the CRQ is used for receiving a queue stored by a read request from the core;

the CWQ queue for receiving write requests from core;

the EPQ is used for receiving a queue of probe requests from the outside;

the ORQ is used for applying for an ORQ when a request in L2 is loaded with an L2pipeline and the cache line does not exist in the cache or the access right of the cache line in the cache is insufficient, and loading data to the next level of memory through the ORQ and taking the corresponding right;

the WRQ is used for applying for the WRQ when a cacheline needs to be written to the next-level memory at L2, and writing data into the next-level memory through the WRQ;

the EFQ is used for backfilling data into EFQ when backfilling L2 with reloaded data, then writing the data into L2$ through pipeline of L2 on EFQ while returning the data to the requesting module;

the CPQ is used for storing the core corresponding to the probe, the probe request is firstly stored in the CPQ, and then the CPQ sends the probe request to the corresponding core.

The invention has the beneficial effects that:

the invention needs to obtain the fastest response in the core or the lower-level memory and effectively improves the overall performance of the CPU.

The invention can avoid the situation that the queue of a certain core is full and other core queues are empty because a certain core does not receive response due to too many requests, thereby ensuring the balance of the queues, avoiding the frequent occurrence of the scene that the certain core cannot continuously send request to the L2, and effectively improving the overall performance of the CPU.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a basic block diagram of an embodiment L2 of the present invention;

fig. 2 is a flow chart of a method implemented by the arbitration mechanism in L2.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 2, the present embodiment provides a method for implementing the arbitration mechanism in L2, in which the arbitration flow is as follows:

arbitration from requests such as crq cwq esq efq orq, each with a weight information;

in an arbitration module, selecting one request to enter pipeline according to the weight information of the requests;

if two or more of the highest weights are the same, then according to efq > esq > orq < crq/cwq;

crq and cwq are in accordance with each other, the priority is round-robin, the previous time crq gets arbitration, and the priority of cwq is higher than that of crq the next time; also the last time cwq got arbitration, the next time crq was higher priority than cwq.

In this embodiment, the basic principle of weight assignment is as follows:

in this embodiment efq, the base level is 3, and the weights of 3-8 are separated according to the number of efq entry items.

As a preferred example of this embodiment, if 6 efq entries are selected, and 1 efq entry is valid, the weight is 3.

As a preference of this embodiment, if 2 efq entry items are valid, the weight is 4.

As a preference of this embodiment, if 3 efq entry items are valid, the weight is 5.

As a preference of this embodiment, if 4 efq entry items are valid, the weight is 6.

As a preference of this embodiment, if 5 efq entry items are valid, the weight is 7.

As a preference of this embodiment, if 6 efq entry items are valid, the weight is 8.

In further implementation of this embodiment, if the basic level of esq is 2, the other weight distribution methods are substantially the same as those of efq.

In further implementation of this embodiment, if the basic level of orq is 1, the other weight assignment methods are substantially the same as those of efq.

In further implementation, if the basic level of crq/cwr is 0, the other weight distribution methods are substantially the same as efq.

The modified core or the lower-level memory of this embodiment needs to obtain the fastest response, cannot be specially treated in the L2 arbitration, or adopts a round random manner, so that such requests cannot obtain the most timely response, thereby affecting the overall performance of the CPU.

The request sent by the core as a variation of this embodiment can not obtain higher priority according to the number of core requests which have not obtained response requests, which easily results in that a core does not obtain response due to too many requests, so that the queue caching the core request is quickly full, and the core cannot continue to send request requests to L2, thereby affecting the overall performance of the CPU.

In order to keep the queues in relative balance, the embodiment does not prevent new requests from entering because the entry item of a queue is frequently full, and some queues are relatively empty. Here, a weighted attribute is introduced to each request, the higher the weight, the higher the priority in arbitration.

Example 2

In other aspects, the present embodiment provides an L2 operation mode, a block diagram of which is shown in fig. 1, and basic functions of which are described as follows:

receiving requests from core and external extensions, including

Receiving a read request from the core, and placing the read request in a CRQ (core read request);

receiving a write request from core, placed in a CWQ (core write request);

receiving a probe request from the outside, and putting the probe request in an EPQ (extended probe request);

the arbitration module receives the requests from CRQ CWQ EPQ ORQ EFQ CPQ, selects one of the requests to enter pipeline;

in pipeline, according to the type of the request, and the TAG information and the MESI status information in the current L2, it is determined that:

whether or not L2 can be written directly; whether the return data can be directly sent to the requester; whether downstream memory reload data or permission is needed; whether evict needs to be generated; whether downstream memory write data is needed; whether a probe core is required, etc.

In this embodiment, if it is determined that downstream memory reload data or right is required, an orq (outgoing request) is allocated.

In this embodiment, if it is determined that downstream memory write data is needed, a wrq (write request) is allocated.

In this embodiment, if it is determined that the probe core is needed, a cpq (core probe request) is allocated, and the L2 sends the probe request to the core, which are all completed through the cpq (core probe request).

L2 issues a request to the downstream memory:

the read request is issued through an orq (outgoing request), and the data is read to the downstream memory and the right corresponding to the data is taken.

And writing the data from the L2 to the next level of memory through a WRQ (write request).

L downstream memory returns the load data back to EFQ (extended fill queue)

Then, the data returned by the load is written into L2$ by L2de pipeline on EFQ (extended file queue), and the core requested by the data return is also written.

Example 3

The embodiment provides a protocol for maintaining data consistency in a memory system, which is used to indicate how authority the current core of the cacheline has, and specifically includes the following steps:

and M, modifying, namely, indicating that the core modifies the cacheline, wherein the cacheline only exists in the cache, and if other caches need to access the cacheline, the latest dirty data can be taken in a probe mode and corresponding rights can be taken in.

And E, Exclusive indicates that the cacheline only exists in the cache, and if other caches need to access the cacheline, corresponding data and authority need to be taken in a probe mode.

S, Shared shows that the cacheline exists in the cache, possibly exists in other caches, and if other caches need to access the cacheline and take the cacheline to the E/M state, corresponding data and authority need to be taken in a probe mode.

Invalid indicates that the cacheline is not present in the cache.

Example 4

The embodiment provides an implementation system for waking up L1 in advance by L2, which specifically includes:

probe: snooping and monitoring, in order to maintain data consistency in a multi-core environment, a probe mode is needed to be used for downloading data probe of dirty in core or modifying the MESI state information in core in order to take the E authority.

And (3) evaluation: because the capacity of the cache is limited, in order to keep the data stored in the cache relatively new, when the data needs to be replaced in the cache, the occurrence is generated.

TAG RAM: the addr of cacheline and the MESI status information of cacheline in L2 and all L2 CORE are recorded.

DATA RAM: and recording the cacheline data information.

CRQ (core read queue) receives the queue stored by the read request from core.

Cwq (core write queue): a queue that receives write requests from core.

Epq (extended probe queue): a queue that receives probe requests from outside.

Orq (outlying request queue): when a request in the L2 goes up with an L2pipeline, it is found that the cacheline does not exist in the cache or the access right of the cacheline in the cache is not enough, an ORQ needs to be applied, and data is loaded to the next level of memory through the ORQ and the corresponding right is taken.

WRQ (write request queue): when the L2 needs to write a cacheline to the next-level memory, applying for WRQ, and writing data into the next-level memory through WRQ.

Efq (extended fill queue): when the L2 is backfilled with reload data, the data is first backfilled into EFQ and then written to L2$ via pipeline of L2 on EFQ while the data is returned to the requesting module.

Cpq (core probe queue): since one L2 hangs multiple cores, it is possible that the latest data required by the module Core needs to be in other cores or when a Core needs to take the E/M authority, at this time, the Core corresponding to the Core is required, the probe request is stored in the CPQ, and then the CPQ sends the probe request to the corresponding Core.

In conclusion, the core or the lower-level memory of the invention needs to obtain the fastest response, thereby effectively improving the overall performance of the CPU.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method implemented by an arbitration mechanism in L2, the method comprising the steps of:

s4 polls the arbitration request for priority if the request is consistent.

2. The method for implementing the arbitration mechanism in the L2 according to claim 1, wherein in the method, crq and cwq poll with priority in case of consistent weight, and the last time crq gets arbitration, the priority of the next cwq is higher than that of crq; if the arbitration was obtained the last time cwq, then the priority of the next time crq is higher than the priority of cwq.

3. The method of claim 1, wherein in the method, when the weights are assigned, if the base level of efq is 2 or 3, the weights of 2-8 or 3-8 are separated according to the number of efq entry entries.

4. The method of claim 3, wherein in the method, when the weights are assigned, if the base level of orq is 1, 1-8 weights are separated according to the number of the orq entry entries.

5. The method as claimed in claim 3, wherein in the method, when the weights are assigned, if the base level of crq/cwr is 0, the weights are separated from 0-8 according to the number of crq/cwr entries.

6. The method for implementing the arbitration mechanism in the L2 according to claim 1, wherein the L2 is further operable to comprise the following steps:

7. The method of claim 6, wherein the arbitration mechanism in L2 is implemented according to the type of the request in pipeline, and the TAG information and MESI status information in current L2: whether or not L2 can be written directly; whether the return data can be directly sent to the requester; whether downstream memory reload data or permission is needed; whether evict needs to be generated; whether downstream memory write data is needed; whether a probe core is required; if the data or the authority of the downstream memory is needed to be uploaded, an ORQ is distributed; if the downstream memory write data is needed, allocating a WRQ; if the probe core is needed, a CPQ is allocated, and the L2 sends a probe request to the core, which are all completed through the CPQ.

8. The method for implementing the arbitration mechanism in the L2 as claimed in claim 6, wherein in the method, if a read request is issued to the downstream memory through the L2, then the read request is issued through the ORQ, and the data is read to the downstream memory and the right corresponding to the data is taken; and sending a write request to the downstream memory through the L2, and writing data from the L2 to the next-level memory through the WRQ.

9. A system implemented by the arbitration mechanism in L2, the system being configured to implement the method implemented by the arbitration mechanism in L2 as claimed in any of claims 1-8, comprising probe, evaluation, TAG RAM, DATA RAM, CRQ, CWQ, EPQ, ORQ, WRQ, EFQ, and CPQ.

10. The system of claim 9, wherein the probe is used for snooping and snooping, dropping dirty data probe in core or modifying the MESI status information in core for getting E right;

the DATA RAM is used for recording cacheline DATA information;

the CRQ is used for receiving a queue stored by a read request from the core;

the CWQ queue for receiving write requests from core;

the EPQ is used for receiving a queue of probe requests from the outside;