CN116795767A

CN116795767A - Multi-core Cache sharing consistency protocol construction method based on CHI protocol

Info

Publication number: CN116795767A
Application number: CN202310033047.7A
Authority: CN
Inventors: 郭兵; 王洋
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-01-10
Filing date: 2023-01-10
Publication date: 2023-09-22

Abstract

A multi-core Cache sharing consistency protocol construction method based on CHI protocol belongs to the field of multi-core Cache sharing consistency protocol. As the data interaction between the cores of the multi-core processor is more frequent, the shared Cache information is increased, so that the problem of improving the state maintenance efficiency of the Cache consistency protocol is solved. A multi-core Cache sharing consistency protocol construction method based on CHI protocol is designed, which comprises the following steps: designing a topological structure of the whole system, wherein the topological structure of the whole system comprises RN0 and RN1; designing a data path of the HN, wherein the data path of the HN comprises a REQ path, an RSP path, an SNP path and a DAT path; designing a composition structure of a Cache in the HN; the design of the Cache in HN is mainly to design the L3Cache in the Cache, and the design comprises a tag_SRAM and a data_SRAM, wherein the tag_SRAM comprises a Tag bit and a Status bit. The invention improves the writing speed of writing the data of the Dirty into the main memory. The state of each Cacheline is tracked, and the state of each Cacheline is updated to maintain Cache consistency according to the read-write operation of the CPUCore and corresponding transactions on a bus.

Description

Multi-core Cache sharing consistency protocol construction method based on CHI protocol

Technical Field

The invention relates to a design method of a Cache sharing consistency protocol, in particular to a multi-core Cache sharing consistency protocol construction method based on a CHI protocol.

Background

To alleviate the problem of "scissors difference" in processors caused by performance differences between processors and memory, cache memories (caches) are introduced in the processor architecture to address. Due to the rapid development of the main frequency of the CPU, the running speed of the kernel is far greater than that of the main memory, and the performance difference between the CPU and the main memory is larger and larger along with the development of the manufacturing technology. With the rapid development of semiconductor process technology and integrated circuit industry, the performance of a single-core processor is rapidly increased, the power consumption of the single-core processor is also linearly increased, and the improvement of the overall performance of the processor is severely restricted by a power consumption wall. The performance approach limits of single-core processors have prompted architectural breakthroughs to be made, for example, to introduce the idea of parallel processing to improve the overall performance of the processor. With the increase of the number of cores integrated by the multi-core processor and the number of stages of storage architecture, the design complexity of the Cache consistency protocol is increased sharply, the problem of a consistency wall is increasingly severe, and a new challenge is brought to the verification of the consistency protocol in the multi-core processor.

In a multi-core processor, a section of program is divided into a plurality of parts and processed by a plurality of cores simultaneously, so that copy data under the same address exists in caches of the plurality of cores, and in order to ensure the consistency of each copy data, a Cache consistency protocol is required to be adopted to maintain the consistency of the data. With the development of multi-core technology, cache consistency issues become the biggest bottleneck restricting the performance of multi-core processors. In the multi-core Cache consistency protocol, a plurality of cores can perform read operation at any moment and only one core can perform write operation for one Cache line data.

In 2013, ARM company published the fifth generation AMBA protocol, the CHI (CoherentHubInterconnect) protocol, as a redesign of the AXI/ACE protocol. The signal-based AXI/ACE protocol has been replaced by a new packet-based CHI layering protocol. This protocol has the advantage over the previous generations of protocols that are suitable for use in multi-core systems, traditional cache coherency protocols include bus snoop based and directory structure based coherency protocols. The bus interception consistency protocol adopts a bus broadcasting mode to transmit the state information of the Cache line to other cores, and the protocol has the advantages of simple design, small occupied area, larger delay and larger traffic, and is not suitable for a large-scale multi-core processor architecture. The protocol based on the consistency of the directory structure adopts the directory structure to store the state information of the Cache lines and directionally transmits the state change information through the directory structure, and the protocol has high processing speed and small occupied bandwidth, but needs larger chip area, and the occupied area of the directory structure is increased sharply along with the increase of the number of the cores. In the age of multi-core processors, the cost of maintaining data consistency for these two consistency protocols is too high to have high efficiency. Therefore, in the era of multi-core processors with more complex application scenarios and lower power consumption requirements, the industry needs a cache coherence protocol with optimized structure and high performance. In addition, with the development of multi-core processors, cache coherence protocols face a plurality of challenges, the number of cores is increased, data interaction between cores is more frequent, and the shared Cache information is increased, so that the following problems are required to be solved along with the increase of the shared Cache information: how to improve the state maintenance efficiency of the Cache consistency protocol, how to balance the storage overhead of the Cache and the power consumption problem caused by consistency interaction, how to improve the expandability of the protocol, and how to design a consistency protocol model. Therefore, the invention carries out research and design on the multi-core cache consistency protocol.

Disclosure of Invention

The invention aims to solve the problems that the state maintenance efficiency of a Cache consistency protocol needs to be improved, the storage cost of the Cache needs to be balanced, the power consumption problem caused by consistency interaction needs to be solved, the expandability of the protocol needs to be improved and a consistency protocol model needs to be designed due to the fact that data interaction between cores of a multi-core processor is more frequent, and provides a multi-core Cache sharing consistency protocol construction method based on a CHI protocol.

The above object is achieved by the following technical scheme:

a multi-core Cache sharing consistency protocol construction method based on CHI protocol comprises the following steps:

first, design the overall topology of the system:

the method comprises the steps that two cores are connected to XP, each XP is respectively connected with HN and SN, for Cache, L1Cache and L2Cache are placed in the RN for management, each RN is a core, and L3Cache is placed in the HN for management; SN stores and hosts, XP is responsible for data forwarding and flow control;

second, design HN's data path:

the data path of HN includes REQ path, RSP path, SNP and DAT path;

thirdly, designing a composition structure of the Cache in the HN;

the method comprises the steps that the Cache in the HN is mainly designed to be an L3Cache in the Cache, the Cache comprises a tag_SRAM and a data_SRAM, the tag_SRAM and the data_SRAM are in one-to-one correspondence, wherein the tag_SRAM comprises a Tag bit and a Status bit, the Tag bit stores a physical address, and the Status stores the state of the current Cache; the data_SRAM stores Data in the Cache, and each cache_line corresponds to the Data with the bit width of 64 Bytes.

Further, the process of designing the data path of HN specifically includes:

firstly, the RN generates a RequestFlit to transmit in a REQ path, the length of a message is 131 bits, and the message carries request information, when the message enters a data path, the message enters a first SetMSHR, and the RequestFlit accessing the same cache_line is blocked;

then, addressFlit, CPU is included in the Requestflit, the corresponding cache_line address is found through the Flit, a D module indicates that the trigger is used for realizing one beat of pipeline delay, after the corresponding cache_line address is found, the trigger enters TxnIDMSHR, the Flit of the RSP and DAT paths enters TxnIDMSHR, responseFlit to 66 bits, and the DATAFlit is 406 bits;

the DECODE module then parses flits for each channel; the TxnIDMSHR judges the resolved Flit to change the state and DATA of the cache_line, writes the DATA to be written into the L3Cache into the WB_BUFFER, and notifies the data_SRAM of the line number of the cache_line to be replaced;

finally, the DATA in the WB_BUFFER is written into the DATA_SRAM.

Further, in the process of designing the composition structure of the Cache in the HN, the specific design content of the L3Cache is as follows:

the size of the whole L3Cache is 1MB, the size of each path is 16KB, each byte in the cache_line is addressed by using PA [5:0], the rest PA [15:6] is used for finding a Set where the cache_line is located, and PA [7:6] is used for selecting a Bank; the bit width of Address adopted by CPU is [43:0], and the bit width of Tag is [43:16];

adopting a Directory structure, wherein the Directory comprises L1 and L2 Tag_SRAM, L1 and L2 are an Inclusive strategy, and L2 Tag is stored in the Directory; l2 is 512KB, 8-way group association structure, and cache_line corresponds to Data bit width 64Byte;

the HN comprises an L3Cache and a Directory, the received Address flit is analyzed in an RX_REQ channel, the received Address flit enters a corresponding band of the L3Cache through arbitration, and the analyzed Tag, index and Byte are combined to determine an address; after finding 16 cachelines according to Index bits in the address, taking out the Tag corresponding to each Cacheline, and comparing the Tag with the Tag in the address, if the Tag is equal, the Tag indicates that the L3Cache hits; if the data is not equal, the data which indicates that the current Cache line stores other addresses is deleted in the L3Cache, and meanwhile, the tags corresponding to the 8 cache_lines are taken out to be compared with the data, if the data is equal, the data is hit in the L2, and if the data is not equal, the data is deleted in the L2; the lowest 2 bits of the Tag bit in the L3Cache represent the Status of the cache_line, and the lowest 2 bits of the Tag bit in the Directory store the states of the L1 and L2 caches.

Further, the L3Cache further includes: the method comprises the steps of (1) a composition form of a multi-core Cache of a first part, a state definition of a Cache line of a second part and a Cache line state conversion, and a Cache consistency protocol based on CHI of a third part; specifically:

(1) In the form of the multi-core Cache of the first part, in a single-core CPU structure, L1 is divided into two parts of an instruction L1I and a data L1D, L2 is the coexistence of the instruction and the data, and the multi-core CPU has a structure similar to that of the single core, but is provided with L3 three-level caches shared by all CPUs; in the multi-core CPU structure, L1 and L2 are private, and L3 is shared by all CPU cores;

the L1Cache is private to the processor kernel, is tightly coupled with the kernel, and has the same working frequency as the CPU; each core is provided with an independent L2 Cache; the secondary cache is a buffer of the primary cache: the primary cache has limited capacity, and the secondary cache stores data which is needed to be used when the CPU processes and cannot be stored in the primary cache; the L3Cache is the largest level in the three levels of caches;

when the CPU operates, it first goes to L1 for the required data, then to L2 for the required data if L1 does not hit, then to L3 for the required data if L2 does not hit; if the three-level cache does not find the data needed by the three-level cache, acquiring the data from the memory;

(2) The second part is the definition of the Cache line state, and 7 types of I and UC, UD, SC, SD, UCE, UDP are used for researching the Cache line state, wherein I indicates that the current Cache line does not contain valid data and is an invalid Cache line; SC (Shared Clean): the Cache of other CPU cores possibly contains the data of the Cache line; the data in the Cache line may be shared with caches in other CPU cores, and each shared data copy is the latest value, namely, is synchronous with the main memory; UC indicates that only the current Cache line exists, and the Cache lines in other CPU cores have no copy; SD indicates that the Cache of other CPU cores possibly contains the data of the Cache line, and the data is changed relative to the main memory, and the Cache must ensure that the data of the final main memory is updated; UD indicates that data only exists in the Cache line, and the Cache line is modified according to the memory and must be written into the next level of Cache or memory;

(3) The third part is a Cache consistency protocol based on CHI; the CHI protocol supports the consistency granularity of 64Byte, the snoopFilter and Directivity, the MESI cache state machine model, and the added Partial and Empty cache states, and is divided into the following parts according to functions: protocol layer, network layer, link layer; the protocol layer is the top-most layer in the CHI architecture.

The beneficial effects of the invention are as follows:

1. the invention can realize the sharing consistency of the multi-core Cache, when the CPU sends a read transaction to the Cache, if the ReadHit occurs, the hit Cache line is in a Valid state, and the ReadHit does not change the state of the Cache line. If ReadMiss occurs, the current state of the Cacheline is Invalid not in the local Cache. At this point the CPU will read the data to the main memory. After the data is read, the data may be written back to the Cacheline. At this point, the Cacheline state will be migrated to Valid, indicating that the data in the Cacheline and the host are consistent at this point. When the CPU sends a write transaction Writeback to the Cache, the state of the Cache line is migrated to the Dirty state according to the write operation, and if the data needs to be updated to the main memory, after the data is written back to the main memory, the state of the Cache line is changed to be Clean.

Specifically, the invention designs three components based on CHI protocol multi-core Cache sharing consistency design: the topology structure of the whole system, the data path of the HN and the composition structure of the Cache in the HN. And each component is designed in detail, so that the writing speed of writing the data of Dirty into the main memory is improved. And the state of each Cache can be tracked, and the state of the Cache can be updated according to the read-write operation of the CPUCore and the corresponding transaction on the bus, so that the Cache consistency is maintained.

2. Because Cache coherence protocols can be generally divided into two major types, namely bus Snoop protocol (Snoop) and Directory protocol (Directory), the invention combines the two to carry out technical improvement: the state of the Cache in part of RNs is recorded in the HN through directors, if the Cache hits in the directors, the corresponding RNs can be directly found, and the Cache operation can be sent to other RNs through the HN to search. If the number of cores is large, resources are wasted by using directors or Snoop alone.

3. In addition, because the speed is slower when the DATA is written back to the main memory, the invention designs a WB_BUFFER, firstly, the DATA of the DATA channel is written into the WB_BUFFER, and the responsibility for writing back is reserved for the WB_BUFFER, so that the transaction can carry out the next transaction without waiting until the transaction is finished, and the speed of processing the transaction is improved.

Drawings

FIG. 1 is a topological structure diagram of the overall system according to the present invention;

fig. 2 is a data path diagram of a 2HN according to the present invention;

FIG. 3 is a diagram of the composition and structure of a multi-level Cache according to the present invention;

FIG. 4 is a state type diagram of a Cache line according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Preferred embodiments of the invention:

referring to fig. 1-4, the present invention provides a method for constructing a multi-core Cache sharing consistency protocol based on CHI protocol, which comprises the following steps: the construction method comprises the following steps:

first, design the overall topology of the system:

the structure diagram is shown in fig. 1, and comprises an RN0 (request node) and an RN1, wherein two cores are connected to XP (crosspoint), each XP is respectively connected with two peripherals of HN (HomeNode) and SN (SubordinateNode), for Cache, an L1Cache and an L2Cache are placed in the RN for management, each RN is a core, and an L3Cache is placed in the HN for management; the HN mainly realizes the Cache consistency of the multi-core. SN stores and hosts, XP is responsible for data forwarding and flow control;

second, design HN's data path:

to implement Cache sharing consistency, the study is mainly performed on the L3Cache, and a structure diagram of the data path of the HN is shown in FIG. 2, wherein the data path of the HN comprises a REQ path (Requesterchannel), a RSP path (Responsechannel), a SNP (Snoopchannel) path and a DAT path (Datachannel);

thirdly, designing a composition structure of the Cache in the HN;

the design of the Cache in the HN is mainly to design the L3Cache in the Cache, the Cache comprises a tag_SRAM and a data_SRAM, the tag_SRAM and the data_SRAM are in one-to-one correspondence, wherein the tag_SRAM comprises a Tag bit and a Status bit, the Tag bit stores a physical address, and the Status stores the state of the current Cache. The method comprises the steps of carrying out a first treatment on the surface of the The data_SRAM stores the Data in the Cache, and the corresponding Data bit width of each cache_line is 64Byte; the addresses used by the CPU to access the Cache are as follows:

TABLE 1 addressing field of CUP Access Cache

Tag28	Index8	Bank2	Byte6
				[43：16]	[15：8]	[7：6]	[5：0]

。

The process of designing the data path of the HN specifically comprises the following steps:

firstly, the RN generates a RequestFlit to transmit in a REQ path, the length of a message is 131 bits, and the message carries the requested information, when the message enters a data path, the message enters a first SetMSHR (MissStatusHandingRegister), and the RequestFlit accessing the same cache_line is blocked;

then, addressFlit, CPU is included in the RequestFlit, the corresponding cache_line address is found through the Flit, a D module indicates that the trigger is used for realizing one beat of pipeline delay, txnIDMSHR (TransactionMissStatusHandingRegister) is entered after the corresponding cache_line address is found, the length of Flit entering TxnIDMSHR, responseFlit of RSP and DAT paths is 66 bits, and the length of DATAFlit is 406 bits;

the DECODE module then parses flits for each channel; the TxnIDMSHR judges the resolved Flit to change the state and DATA of the cache_line, writes the DATA to be written into the L3Cache into a WB_BUFFER (WriteBack_BUFFER), and informs the DATA_SRAM of the line number of the cache_line to be replaced;

finally, the DATA in the WB_BUFFER is written into the DATA_SRAM.

In the process of designing the composition structure of the Cache in the HN, the specific design content of the L3Cache is as follows:

the size of the whole L3Cache is 1MB, and because the Cache is divided into 4 banks, the size of each Bank is 256KB, and a 16-way group-connected structure is adopted, the size of each way is 16KB, and therefore PA [15:0] can address the Set, and because the size of each data block is 64 bytes, PA [5 ] is required to address each Byte in the cache_line: 0], remaining PA [15:6] is used to find a Set where the cache_line is located, wherein PA [7:6] for selecting a Bank; the bit width of Address employed by the CPU is [43:0], so the bit width of Tag is [43:16].

The invention adopts a Directory structure, the Directory comprises L1 and L2 Tag_SRAM, L1 and L2 are an Inclusive strategy, and L2 Tag is stored in the Directory; l2 is 512KB, 8-way group association structure, and cache_line corresponds to Data bit width 64Byte;

the HN comprises an L3Cache and a Directory, the received Address flit is analyzed in an RX_REQ channel, the received Address flit enters a corresponding band of the L3Cache through arbitration, and the analyzed Tag, index and Byte are combined to determine an address; therefore, when we find 16 cachelines according to Index bits in the address, fetch the Tag corresponding to each Cacheline, and then compare with the Tag in the address, if equal, this indicates an L3Cache hit; if not, the data stored in the current Cache is indicated to be the data of other addresses, and the data is deleted in the L3 Cache. Only the tags stored in the Directory are associated with 8 paths of groups, and meanwhile, the tags corresponding to 8 cache_lines are fetched to be compared with the tags, if the tags are equal, the tags are hit in L2, and if the tags are unequal, the tags are missing in L2; the lowest 2 bits of the Tag bit in the L3Cache represent the Status of the cache_line, and the lowest 2 bits of the Tag bit in the Directory store the states of the L1 and L2 caches.

The L3Cache further comprises: the method comprises the steps of (1) a composition form of a multi-core Cache of a first part, a state definition of a Cache line of a second part and a Cache line state conversion, and a Cache consistency protocol based on CHI of a third part; specifically:

(1) In the form of the multi-core Cache of the first part, in a single-core CPU structure, in order to relieve cycle conflicts in CPU instruction pipeline, L1 is divided into two parts of instructions L1I and data L1D, L2 is the coexistence of instructions and data, and the multi-core CPU has a structure similar to that of a single core, but is provided with L3 level three caches shared by all CPUs. In the multi-core CPU structure, L1 and L2 are private, and L3 is shared by all CPU cores; the composition structure is shown in figure 4;

the L1Cache is private to the processor kernel, is tightly coupled with the kernel, and has the same working frequency as the CPU; l2Cache is usually larger in capacity and slower than L1, and each core is provided with an independent L2 Cache; the secondary cache is a buffer of the primary cache: the primary cache has high manufacturing cost, so that the capacity of the primary cache is limited, and the secondary cache stores data which is needed to be used when the CPU processes and cannot be stored in the primary cache; the L3Cache is the largest level in the three levels of caches; for example, 12MB, and also the slowest level, the same CPU core shares an L3 Cache.

When the CPU operates, it first goes to L1 for the required data, then to L2 for the required data if L1 does not hit, then to L3 for the required data if L2 does not hit; if the three-level cache does not find the data needed by the three-level cache, acquiring the data from the memory; the longer the path is found, the longer the time consuming. So if some data is to be fetched very frequently, it is guaranteed that the data is in the L1 cache. So that the speed will be very fast.

(2) The second part is the definition of the Cache line state, and in the invention, 7 types of I and UC, UD, SC, SD, UCE, UDP are used for researching the Cache line state, as shown in fig. 4. Wherein I (Invalid): indicating that the current Cache line does not contain valid data and is an invalid Cache line. SC (Shared Clean): the Cache of other CPU cores possibly contains the data of the Cache line; the data in the Cache line may be shared with caches in other CPU cores, and each shared data copy is the latest value, namely, is synchronous with the main memory; UC (Unique Clean): indicating that only the current Cache line exists, and the Cache lines in other CPU cores have no copy; SD (Shared Dirty): the Cache representing other CPU cores may contain data of the Cache line, and the data is changed relative to the main memory, and the Cache must ensure that the data of the final main memory is updated; UD (Unique Dirty): indicating that data only exists in the Cache line, and the Cache line is modified according to the memory and must be written into the next level of Cache or memory;

(3) The third part is a Cache consistency protocol based on CHI; the CHI protocol supports the consistency granularity of 64Byte, the snoopFilter and Directivity, the MESI cache state machine model, and the added Partial and Empty cache states, and is divided into the following parts according to functions: protocol layer, network layer, link layer; the protocol layer is the top layer in the CHI architecture, is also the most important for the study of Cache consistency, and has the functions of: 1. generating and processing requests and responses at protocol nodes; 2. defining the cache state and state transition allowed by the protocol node; 3. defining a transmission flow of each request type; 4. and managing the flow control of the protocol layer.

The embodiments of the present invention are disclosed as preferred embodiments, but not limited thereto, and those skilled in the art will readily appreciate from the foregoing description that various extensions and modifications can be made without departing from the spirit of the present invention.

Claims

1. A multi-core Cache sharing consistency protocol construction method based on CHI protocol is characterized in that: the construction method comprises the following steps:

first, design the overall topology of the system:

second, design HN's data path:

the data path of HN includes REQ path, RSP path, SNP and DAT path;

thirdly, designing a composition structure of the Cache in the HN;

2. The method for constructing the multi-core Cache sharing consistency protocol based on the CHI protocol according to claim 1, wherein the method comprises the following steps: the process of designing the data path of the HN specifically comprises the following steps:

finally, the DATA in the WB_BUFFER is written into the DATA_SRAM.

3. The method for constructing the multi-core Cache sharing consistency protocol based on the CHI protocol according to claim 2, wherein the method comprises the following steps: in the process of designing the composition structure of the Cache in the HN, the specific design content of the L3Cache is as follows:

4. The method for constructing the multi-core Cache sharing consistency protocol based on the CHI protocol according to claim 3, wherein the method comprises the following steps: the L3Cache further comprises: the method comprises the steps of (1) a composition form of a multi-core Cache of a first part, a state definition of a Cache line of a second part and a Cache line state conversion, and a Cache consistency protocol based on CHI of a third part; specifically: