CN111143244B

CN111143244B - Memory access method of computer equipment and computer equipment

Info

Publication number: CN111143244B
Application number: CN201911394845.2A
Authority: CN
Inventors: 蔡云龙
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2022-11-15
Anticipated expiration: 2039-12-30
Also published as: CN111143244A

Abstract

The present disclosure provides a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the other nodes are locally stored in the caches of the nodes to form a remote cache local image; and a processor core of the node accessing a cache line from the remote cache local image. According to the method and the device, the nodes in the computer equipment mirror the cache data at other nodes to local storage, so that the times of accessing the memory across the nodes are reduced, and the computer performance in a large-memory application scene is improved.

Description

Memory access method of computer equipment and computer equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a memory access method for a computer device and a computer device.

Background

The existing server based on NUMA (non-uniform memory access) architecture can combine dozens of CPUs (even hundreds of CPUs) in one server, and has good expansion capability. In a NUMA architecture server, multiple nodes are connected together by an interconnection network, each node having independent CPUs, caches, memory and I/O devices, and all memory is shared throughout the server. In this case, the CPU has faster and less latency in accessing local memory within the node than in accessing remote memory outside the node. However, in some applications with large memory usage (such as database lookup), the CPU of each node inevitably needs to access the remote memory frequently, which greatly reduces the performance of the server.

Disclosure of Invention

In view of this, a memory access method for a computer device and a computer device are provided, where the computer device has multiple nodes, and a node in the computer device mirrors cached data at other nodes to a local storage, so that the number of times of accessing a memory across nodes is reduced, and the computer performance in a large-memory application scenario is improved.

According to a first aspect of the present disclosure, there is provided a memory access method for a computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising: the cache lines of the nodes in the caches of other nodes are locally stored in the nodes to form a remote cache local image RCLS; and a processor core of the node accessing the cache line from the RCLS.

In one possible embodiment, the node may include a plurality of processor cores, and the cache may be an L3 cache shared by the plurality of processor cores.

In one possible embodiment, the RCLS is stored in memory or L4 cache of the node.

In one possible embodiment, the cache line of the RCLS may be updated periodically, or may be updated when the cache line in the other node's cache is infrequently read and written.

In one possible embodiment, the cacheline may include write history information that records the number of times the cacheline was written in the L3 cache and may be cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.

In one possible embodiment, the method may further include: a processor core of the node broadcasting a cache line request message on the interconnect bus; a processor core of the node receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; the processor core of the node compares the write history information of the cache line in the RCLS with the write history information of the response message; based on the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node.

In one possible embodiment, when the comparison indicates that the cacheline of the other node is the same as the cacheline of the RCLS, the cacheline may be obtained from the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline may be obtained from the other node and the RCLS may be updated.

In one possible embodiment, the method may further include, after the processor core of the node acquires the cache line, modifying a flag bit of the cache line in the cache of the other node to be shared when the access is a read operation; when the access is a write operation, the flag bit of the cache line in the cache of the other node is modified to be invalid.

According to a second aspect of the present disclosure, there is provided a computer device comprising a plurality of nodes connected via an interconnect bus, each node comprising an integrated processor core and cache, and a memory, wherein a node of the plurality of nodes has stored locally on it a remote cache local image, RCLS, for storing a cache line in the caches of other nodes and providing the cache line to the processor core of the node.

In one possible embodiment, the node may include a plurality of processor cores, and the cache is an L3 cache shared by the plurality of processor cores.

In one possible embodiment, the RCLS may be stored in memory or L4 cache of the node.

In one possible embodiment, the cache line of the RCLS may be updated periodically, or when the cache line in the other node's cache may be read and written infrequently.

In one possible embodiment, the cacheline may include write history information that records the number of times the cacheline was written in the L3 cache and may be cleared or modified to be coherent in the L3 cache and the RCLS when transferred to the RCLS.

In one possible embodiment, the processor core of the node may be configured to: broadcasting a cache line request message on the interconnect bus; receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line; comparing write history information of a cache line in the RCLS with write history information of the response message; obtaining the cacheline from the RCLS or obtaining the cacheline from the other node according to the comparison.

In one possible embodiment, the processor core of the node may be further configured to obtain the cache line from the RCLS when the comparison indicates that the cache line of the other node is the same as the cache line of the RCLS; when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline is obtained from the other node, and the RCLS is updated.

In one possible embodiment, the processor core of the node is further configured to modify the flag bit of the cache line in the cache of the other node to be shared when the access is a read operation after the cache line is acquired; when the access is a write operation, the flag bit of the cache line in the cache of the other node is modified to be invalid.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts, unless otherwise specified.

FIG. 1 shows a schematic diagram of a computer device of a NUMA architecture in accordance with an embodiment of the present disclosure.

Fig. 2 is a schematic flow chart diagram illustrating a memory access method for a NUMA architecture according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a computer device of another NUMA architecture in accordance with an embodiment of the present disclosure.

FIG. 4 shows a schematic flow diagram of another method of memory access by NUMA according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense. For example, in the present disclosure, the term "cache" has the same meaning as "cache", and the term "memory" has the same meaning as "main memory", or "system memory"

Fig. 1 shows a schematic diagram of a computer device 100 of a NUMA architecture in accordance with an embodiment of the present disclosure. Computer device 100 may be, for example, a server device, a home computer, a High Performance Computer (HPC), or the like. As shown, computer device 100 is based on a NUMA architecture, including two

NUMA nodes

101a and 101b, which may also be referred to as sockets. For ease of presentation and understanding, reference numerals having the letter "a" herein shall be understood as components or parts associated with node 101a, and reference numerals having the letter "b" shall be understood as components or parts associated with node 101 a. Although fig. 1 shows only two

nodes

101a and 101b, those skilled in the art will appreciate that the computer device 100 may include more nodes, and the number is not limited thereto.

Nodes

101a and 101b in fig. 1 are similarly arranged, and the specific composition of the computer apparatus 100 is described below with reference to only node 101 a.

As shown, node 101a may include processor device 110a, memory 120a, I/O devices, etc. (not shown for brevity). The processor device 110a may be a single-core or multi-core processor chip, for example, as in the embodiment shown in fig. 1, the processor device 110a includes a plurality of processor cores 111a and 112a, etc., each of which includes a core and an L1 cache (L1 cache) and an L2 cache (L2 cache) that are specific to the core. Within the core may be a memory access component, a bus processing component, etc., implemented in hardware or software.

In the illustrated embodiment, the remainder of the processor device 110a includes various interconnection circuits and interfaces and components for connecting various functional blocks on the processor device in communication. As shown, within processor device 110a, multiple processor cores 111a and 112a may communicate with each other via an intra-socket interconnect 113 a. For simplicity, the interconnect is depicted as an intra-socket interconnect 113a, however, it should be understood that interconnect 113a may represent one or more interconnect structures, such as buses and single or multi-channel serial point-to-point, ring, or mesh interconnect structures. Also included within processor device 110a is an L3 cache 114a for sharing by multiple processor cores; processor cores 111a and 112a may access shared L3 cache 114a via intra-socket interconnect 113 a. Processor device 110a may also include a memory controller 115a for controlling the transfer of read and write data between processor device 110a and memory 120 a. In addition, processor device 110a also includes socket-to-socket (S-S) interconnect interface 116a, and S-S interface 116a may receive data on intra-socket interconnect 113a, send and receive data to and from other sockets or nodes via inter-socket interconnects, and vice versa. In addition to these illustrated blocks, the processor 110a may include many other functional blocks that are not shown for clarity, such as a PCIe interface, a network interface controller NIC, and so forth.

In fig. 1, each processor device may be operatively coupled to a printed circuit board, referred to as a motherboard, via a socket, or otherwise coupled to the motherboard via a direct coupling technique (e.g., flip chip bonding). The motherboard includes electrical wires (e.g., traces and vias) to achieve electrical connections corresponding to the physical structure of the various interconnections depicted in fig. 1. These interconnects may include a socket-to-socket (S-S) interconnect 102 between socket-to-socket (S-S) interconnect interfaces 116a and 116b, which are used to communicatively couple multiple processor nodes. In one embodiment, the inter-socket interface S-S interconnect 102 may employ or support QPI or Infinity Fabirc, among other protocols and wiring structures.

As shown in fig. 1, L1 cache, L2 cache, L3 cache 114a, and memory 120a form a storage architecture based on the distance from the core from near to far. Generally, the closer to the core, the faster the core can access, and the less latency. For example, the L1 cache and the L2 cache may be relatively high cost Static Random Access Memory (SRAM), with core access L1 cache latency of about 1-4 clock cycles and access L2 cache latency of about 12 clock cycles. The L3 cache may be a relatively low cost Dynamic Random Access Memory (DRAM) or enhanced RAM (eDRAM) or the like with an access latency of approximately 36 clock cycles. While memory 120a may typically be a dynamic random access memory DRAM or the like with an access latency of about 60-100 clock cycles.

Due to data limitations, including spatial and temporal limitations, processors (such as the core of fig. 1) often need to read data multiple times in a short time. In view of the fact that the operating frequency of the memory is far lower than the speed of the processor, through the storage architecture, the processor sequentially queries whether needed data exist in the L1-L3, and therefore the processor can read the needed data from the cache of each level instead of the memory, namely, the processor hits (hit), and therefore the time for data access is greatly saved.

As shown in FIG. 1, L3 cache 114a includes cache lines (L1 and L2 are similar) and when all of L1-L3 misses, the needed data may be loaded from memory 120 a. Specifically, a memory line (or block of memory) containing the desired data will be loaded and held in the L3 cache in the form of a cache line for subsequent re-access. Similarly, when data is loaded from the L3 cache for processor access, the data is further promoted to the L2 cache, and so on.

Referring to table 1, a schematic structure of a cache line of an embodiment of the present disclosure is shown.

Label (R)

Data block

Marker bit

TABLE 1

A cache line may include a tag (tag) including a portion of a memory address, a data block (data block) that is the content of the cache line, i.e., a valid data portion, and a flag (flag bits) that may generally include whether data is overwritten (Modified), exclusive (Exclusive), shared (Shared), valid (Invalid), etc. for controlling cache coherency.

The bus processing unit 117a of the core may determine whether the cache is hit according to the memory address of the required data. Table 2 shows an exemplary structure of an effective memory address according to an embodiment of the present disclosure.

Label (R)

Index

Offset in block

TABLE 2

The effective memory address may include a tag, an index, and an intra-block offset, where the index indicates an index number of a data block loaded into the cache set, and the effective memory address index is combined with the tag to search for a cache line in each level of cache and determine whether the cache line is hit. The intra-block offset represents the offset within the cache line of data required by the processor.

For example, if the L1 cache size is 8k, each cache line is 64 bytes, and 4 cache lines make up a cache set. Then there are 8k/64 bytes =128 cache lines in total, with every 4 groups, and then 128/4=32 cache groups. Therefore, the intra-block offset is 6 bits (2 ^6= 64), the index is 5 bits (2 ^5= 32), and the tag is 21 bits (32-5-6). The above briefly describes the mapping between memory addresses and cache lines and the mechanism for looking up cache lines from memory addresses in a cache.

Returning to FIG. 1, there are multiple nodes under a NUMA architecture. In this case, for example, when processor core 112a does not find a desired cache line in any of its local caches L1-L3, its bus processing component 117a may broadcast a snoop Signal (SNOOPING) on the interconnect bus, querying other nodes for whether their L3 stores the desired data. The purpose is, on the one hand, to save additional latency associated with memory accesses and, more importantly, to ensure that the latest data is read, since the same memory data may have been cached in L3 of one or more other nodes and overwritten. As shown in fig. 1, a snoop signal is broadcast on S-S interconnect bus 102 via intra-socket interconnect bus 113a and S-S interface 116a, and response information of other nodes is received. This operation may be implemented by the bus processing component 117 a. Bus processing component 117a is shown in fig. 1 as a separate component within node 101a, alternatively bus processing component 117a may be a software implementation or logic configured within the cores of processor cores 112a, and each processor core may have a bus processing component to accomplish similar operations.

Figure 1 shows sending and receiving snoop signals between node 101a and node 101b, however, when there are more nodes, the snoop signals are broadcast to each node on the interconnect bus 102. Although snoop communication between bus processing components 117a and 117b is shown as a separate line in fig. 1, this is for clarity only, it being understood that the communication is actually done over S-S interconnect 102. For example, when node 101b receives a snoop signal and finds the requested line (valid) in its L3 cache, processor 112a sends the most recent copy of the line to node 101a via interconnect bus 102. When no valid response is received after the snoop signal is issued, the request data is accessed from the memory. When data is requested from a memory access, there is a different latency depending on the memory address being local and remote to the node, as described in more detail below.

Under NUMA architecture, each processor (and processor core) can access different memory resources distributed across the various nodes. In other words, all memory is shared by all processors, together forming a unified memory address space. Memory may be considered a local memory resource (e.g., a memory resource on the same node as a processor or core) or a remote memory resource (e.g., a memory resource on another node). Generally, for example, from the perspective of node 101a, memory 120a is a local memory resource and memory 120b is a remote memory resource. Since the local memory resources are operatively coupled to the processor for a given node, access to the local memory resources is not the same (i.e., access is non-uniform) with respect to the remote memory resources. For example, the latency of the processor 110a accessing the local memory resource 120a is about 80ns, while the latency of accessing the remote memory resource can be up to 250ns or more. Even considering that data may be accessed from the L3 cache of node 101b without having to access remote memory 120b, the data transfer across nodes may introduce a delay of about 100 ns. Therefore, a policy of local memory resource priority is required to be adopted during actual memory allocation. Unfortunately, in some applications (e.g., database lookup tables) with large memory usage, the processor of each node inevitably has to frequently access remote memory, which greatly reduces the performance of the server.

FIG. 2 shows a schematic flow diagram of a method of memory access in a NUMA architecture in accordance with an embodiment of the present disclosure. This is explained with reference to fig. 1.

At step 210, the processor core issues a memory access request. For example, machine instructions obtained by processor core 112a include a read or write to a memory address, creating a data access need for the memory address.

Then, at step 220, it is sequentially looked up whether the cache lines associated with the required data are included in L1 and L2 associated with the processor core. Specifically, a cache line is searched in L1 and L2 according to the memory address information. In one embodiment, each cache line may include 64 bytes of data, and when the required data is included in a cache line (the corresponding cache line should be "valid"), L1 and L2 hits. In this case, the corresponding cache line is promoted to a more recent cache (e.g., L2 to L1) for use by the processor core, and the flow ends. Otherwise, proceed to step 230.

At step 230, the local L3 cache is queried for the required data. For example, a lookup in L3 114a is made to see if there is data needed (the corresponding cache line should be "valid"). If so, L3 hits, the corresponding cache line is promoted to a closer cache (e.g., L3 to L2 to L1) for use by the processor core, and the process ends. Otherwise, proceed to step 240.

In step 240, in the event that none of L1-L3 hits, a snoop signal is broadcast on the interconnect bus. For example, processor core 112a may utilize bus processing component 117a to broadcast a snoop signal on S-S interconnect bus 102 via interconnect bus 113a and interface 116a so that bus processing components of other nodes may snoop.

Next, at step 250, the other nodes look up the needed data in their L3 caches. If found (the corresponding cache line should be "valid"), then an L3 hit occurs. In this case, a copy of the corresponding node cache line is sent onto the S-S bus for receipt by the node requesting the data. For example, a processor core in node 101b may send a cache line in its L3 cache 114b to node 101a via S-S bus 102. Accordingly, the cache line is stored in L3 cache 114a at node 101a and promoted to a closer cache (e.g., L3 to L2 to L1) for use by the processor core, and the process ends. If there are no hits at other nodes, proceed to step 260.

In step 260, in the event that none of the caches hit, the memory access request is submitted to the memory controller at the node of the required data, the data is accessed from the memory, and the process ends. If the local memory access is performed, the required data is cached to a local L3; if it is a remote memory access, the required data is cached in the remote L3 and then transmitted to the L3 at the requesting node via the S-S bus.

As described above, when accessing a remote memory, a large delay may be generated in memory access and data transmission across nodes, which may result in performance degradation, and this is especially obvious in a large memory application scenario. In order to alleviate or mitigate this problem, embodiments of the present disclosure provide a technical solution for remote cache local image (RCLS), which improves data access efficiency by storing a remote cache image in a local backing store with a larger capacity. In one embodiment, the L3 cache image at the remote node may be stored in a local L4 cache, or Last Level Cache (LLC). In another possible embodiment, the L3 cache image at the remote caching node may be stored in local memory if the L3 is already the last level cache LLC.

Fig. 3 shows a schematic diagram of a computer device 300 of another NUMA architecture in accordance with an embodiment of the present disclosure. In contrast to fig. 1, the computer device 300 of fig. 3 further comprises: the cache lines in L3 of the remote nodes are stored in the backing store of each node lower than the L3 cache, as indicated by reference numerals 305 and 330a. It should be noted that although fig. 3 shows two

nodes

301a and 301b, those skilled in the art will appreciate that computer device 300 may include many more nodes and that each node may store a cache line from any other node with remote cache local mapping, RLCS, enabled. As will be described in detail below.

According to embodiments of the present disclosure, computer device 300 may be configured to enable remote cache local mapping, RLCS, in which case each node in computer device 300 may receive a cache line in the L3 cache of the other node via interconnect bus 302. In other words, in addition to being cached at each node L3, the memory data is mirrored (further cached) at a lower level of storage at each node. Generally, memory below L3 has a much larger memory space than the L3 capacity, so each node has enough memory capacity to mirror the L3 cache at other nodes without impacting local usage.

For example, a computer device of an exemplary NUMA architecture may include 64 nodes, each of which may have, for example, 16M L3 capacity and 16G local memory. With the RCLS function enabled, even if all the cache lines of the remote L3 are mirrored to each node local memory, only 64 × 1um/1lg =6.4% of the memory capacity is occupied. However, in actual operation, the RCLS function may not be selected to be enabled on all nodes, but only a portion of the nodes, and each RCLS may also be mirrored locally on demand, rather than all of the remotely cached cache lines. In fact, the scale (number of nodes in NUMA) of enabling RCLS can be flexibly adjusted, depending on the cache capacity of each node and the capacity of the next level memory, so as to achieve better performance of the computer device.

Referring to FIG. 3, arrow 305 represents mirroring of the L3 cache of node 301b to node 301a local to form a remote cache local image RCLS 330a. This RCLS 330a may be stored in any memory below the local L3 cache, e.g., when node 101a has L4 cache 318a, RCLS 330a is stored in the L4 cache; when node 101a does not have an L4 cache, RCLS 330a may be stored in the allocated memory area. Similarly, as shown, the memory 320b and the L4 318b of the node 301b store the RCLS of the L3 cache of the other node. It should be noted that although FIG. 3 only shows cache lines of the L3 cache of node 301b being stored at node 101a, it is to be understood that cache lines of the L3 cache from any other node may be stored on each NUMA node that has RLCS enabled.

According to the embodiment of the present disclosure, the cache line of the L3 cache of each node may be transmitted to other nodes in a synchronous or asynchronous manner, that is, RCLS may be updated in time or with a delay. For example, in one embodiment, when memory controller 316b of node 301b reads data from memory 320b and places it into L3 cache 314b, the data will be synchronized to the interconnect bus and cached via the S-S bus to L4 cache 318a or memory 320a for updating RCLS at node 101 a. In one possible embodiment, updating RCLS 330a behavior at node 101a may be initiated proactively by memory controller 316b of node 301b, thereby enabling real-time updating of RCLS. Alternatively, the behavior of updating RCLS 330a at node 101a may be delay triggered, such as when the processor core of node 101b is not reading or writing L3 or timed to trigger. Alternatively, the RCLS update may also be triggered by snooping L3 cache update messages on the interconnect bus, e.g., bus processing component 317a of node 101a may snoop all L3 cache updates on the interconnect bus, with node 101a deciding by itself whether to update its local RCLS from the L3 cache of node 101b (and other nodes).

As described above, in addition to being cached at each node L3, the data in the memory is cached in the remote cache local image of each node, and the data is cached in the form of cache lines. Embodiments of the present disclosure also provide a structure of a cache line in order to meet cache coherency requirements and reduce data transmission across nodes. The cache line structure extends the structure shown in table 1.

Label (R)

Data block

Marker bit

WH

TABLE 3

Referring to table 3, according to an embodiment of the present disclosure, a Write History flag (Write History: WH) is extended in a flag of a cache line for recording the information of the number of times of writing of effective alteration of data. In one embodiment, the WH flag bit is made up of a plurality of bits, it being understood that the greater the number of WH bits, the greater the number of cache line writes that can be recorded.

In a conventional cache coherency protocol, whenever a cache line is written, write cache information needs to be broadcast on the system bus to be sensed by other nodes, thereby modifying the flag bit information of the own cache line and updating the data block. In this case, the data traffic required to maintain cache coherency in the NUMA architecture occupies at least about 15-20% of the interconnect bus. In contrast, the embodiment of the present disclosure introduces the write history flag WH into the flag and combines with the remote cache local mapping RCLS, so that data used for maintaining cache consistency can be greatly reduced, and bus bandwidth resources are saved.

According to the embodiment of the present disclosure, when the read-write operation of the cache line is less, for example, when the processor core does not read and write the L3 cache in about 10 milliseconds (which is a relatively long time compared to the speed of the CPU), the cache line of the L3 cache may be packed and sent to the interconnection bus so as to be synchronized to the RCLS of other nodes, that is, the RCLS may be updated with a delay, which may ensure that the processor is not blocked when the data is frequently read and written. On the other hand, because the WH zone bit for recording the writing time information is used, the processor core does not need to broadcast the writing cache information to the interconnection bus during each writing cache; in other words, the processor core may resynchronize to the RCLS of other nodes after accumulating modifications of multiple cache data blocks.

According to an embodiment of the present disclosure, whenever a processor or processor core writes to the L3 cache, the history identification bit WH of the corresponding cache line is incremented by 1, and the cache line should be synchronized to the RCLS of the respective nodes with a WH value of 15 (0 xF) before the WH overflows, for example, in the case where the WH has 4 bits (the present invention is not limited thereto). Also as described above, when there are fewer cache line read and write operations, the cache line is packed for transmission to the interconnect bus for synchronization to other RCLS. After the L3 cache is synchronized to the RCLS node, the WH bits of all cache lines in the L3 cache and the RCLS are cleared or modified to be consistent in the L3 cache and the RCLS. In other words, the WH bit may be used to indicate the coherency of the L3 cache with the RCLS.

A memory access method of a computer device according to an embodiment of the present disclosure will be described below with reference to fig. 4.

FIG. 4 is a schematic flow chart diagram illustrating another method 400 of memory access for NUMA architecture according to an embodiment of the present disclosure. Here, assume that the method is performed in a computer device having 4 NUMA NODEs, which are labeled NODE _0, NODE _1, NODE _2, and NODE _3. For example, each node includes 32 processor cores, denoted as CPU 0 through CPU 127, each node having a respective L3 cache, denoted as L3_0, L3_1, L3_2, L3_3. Here, the node may have an L4 cache, and the remote cache local image (RCLS) is stored in the L4 cache; the node may not have an L4 cache, and the remote cache local image is stored in the memory of the node. The RCLS of the nodes are denoted as RCLS _0, RCLS _1, RCLS _2, and RCLS _3, respectively.

Taking CPU 0 as an example, the method 400 begins when a memory read/write request generated by CPU 0 misses in its local L1-L3.

In step 410, the bus processing component of CPU 0 broadcasts a data request message over the interconnect bus requesting a cache line including the desired data. The data request message may be sensed by the bus processing components of the other processor cores and looked up in the L3 cache for data needed by CPU 0.

Then, at step 420, if a far-end cache hit, for example, the L3_3 cache of CPU 127 (at NODE _ 3) has the latest version of the data needed by CPU 0, CPU 127 may synchronize the corresponding cache line from its L1, L2 cache to the L3_3 cache and write-lock the corresponding cache line in L3_3 such that the cache line is not allowed to be overwritten until the present read-write operation of CPU 0 is complete, and the method proceeds to step 430. If the remote cache is not hit, the flow of the method is ended, and the CPU 0 acquires data from the memory.

At step 430, a response message issued by CPU 127 on the interconnect bus is received, which may include write history information WH of the cache line of demand data. In one embodiment, to improve bus utilization efficiency, the response message also includes a portion of the data block contents of the cache line. For example, the response message may be 16 bytes, and WH occupies only a part (several bits) of the response message, and the rest may fill a part of the data of the cache line, so as to reduce the amount of data to be transmitted subsequently.

Then, in step 440, cpu 0 compares the WH in the response message with the WH in RCLS _ 0. If so, CPU 0 fetches the cache line from local RLCS _0 and loads it into its local L3_0 at step 450, and sends a data transfer complete message to NODE _3 at step 470, and CPU 127 modifies the flag bit of the corresponding cache line in L3_3.

If the WHs in RCLS and the response message from CPU 127 are not consistent, i.e., the WH provided at the node L3_3 where CPU 127 is located is newer than the WH in RCLS, then at step 460, CPU 0 fetches the cache line from the node of CPU 127 to store to L3_0, promotes to L2, L1, register usage of CPU 0, and synchronizes the cache line in RCLS 0 to the cache line provided by CPU 127. At this point, the WH is cleared or modified to be consistent. In addition, upon completion of the transfer of L3_3 to L3_0, it may also be determined whether a write to memory is required as appropriate, for example, in a write back mode (write back) or a write through mode (write through).

In step 470, the CPU sends a data transfer complete message to the CPU 127. If the access request of CPU 0 is a read operation, CPU 127, upon receiving the transfer complete message, may modify the flag bit of the corresponding cache line of L3_3 from Exclusive (Exclusive) to Shared (Shared), for example, and may unlock the write lock of the cache line of L3_3. If the access request of CPU 0 is a write operation, CPU 127 may modify the flag bit to Invalid (Invalid), and in fact, in the case of a write operation, an Invalid signal is placed on the bus to be heard by other nodes according to the cache coherency protocol, so that the flag bits of the cache line copy of other RCLS are invalidated in addition to the data in L3_0, and therefore there is no need to update the RCLS at other nodes.

Although example embodiments have been described, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the inventive concept. Accordingly, it should be understood that the above example embodiments are not limiting, but illustrative.

Claims

1. A memory access method of a computer device, the computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache, and a memory, the method comprising:

the cache lines of the other nodes in the caches of the nodes are locally stored by the nodes in the plurality of nodes to form a remote cache local image RCLS; and

a processor core of the node accesses the cache line from the RCLS;

wherein the node comprises a plurality of processor cores, the cache being an L3 cache shared by the plurality of processor cores; the RCLS is stored in memory or L4 cache of the node.

2. The method of claim 1, wherein the cache line of the RCLS is updated periodically or when the cache line in the other node's cache is infrequently read and written.

3. The method of claim 1, wherein the cache line includes write history information that records a number of times the cache line was written in an L3 cache and is cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.

4. The method of claim 1, further comprising:

a processor core of the node broadcasting a cache line request message on the interconnect bus;

a processor core of the node receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line;

the processor core of the node compares the write history information of the cache line in the RCLS with the write history information of the response message; and

based on the comparison, the processor core of the node obtains the cache line from the RCLS or obtains the cache line from the other node.

5. The method of claim 4, wherein,

when the comparison indicates that the cache lines of the other nodes are the same as the cache lines of the RCLS, acquiring the cache lines from the RCLS; and

when the comparison indicates that the cacheline of the other node is different from the cacheline of the RCLS, the cacheline is obtained from the other node and the RCLS is updated.

6. The method of claim 1, after a processor core of the node fetches the cache line, the method further comprising

When the access is a read operation, modifying the flag bit of the cache line in the cache of the other node to be shared; and

when the access is a write operation, modifying the flag bit of the cache line in the cache of the other node to be invalid.

7. A computer device comprising a plurality of nodes connected via an interconnection bus, each node comprising an integrated processor core and cache memory, and a memory, wherein

A node of the plurality of nodes stores a remote cache local image (RCLS) locally, wherein the RCLS is used for storing cache lines in caches of other nodes and providing the cache lines to processor cores of the node;

wherein the node comprises a plurality of processor cores, the cache being an L3 cache shared by the plurality of processor cores; the RCLS is stored in the memory or L4 cache of the node.

8. The computer device of claim 7, wherein the cache line of the RCLS is updated periodically or when the cache line in the other node's cache is infrequently read and written.

9. The computer device of claim 7, wherein the cacheline comprises write history information that records a number of times the cacheline was written in an L3 cache and is cleared or modified to be consistent in the L3 cache and the RCLS when transferred to the RCLS.

10. The computer device of claim 7, wherein a processor core of the node is configured for

Broadcasting a cache line request message on the interconnect bus;

receiving a response message of the other node, the response message indicating that the other node has the cache line in its cache, the response message including write history information of the cache line;

comparing the write history information of the cache line in the RCLS with the write history information of the response message; and

obtaining the cacheline from the RCLS or obtaining the cacheline from the other node according to the comparison.

11. The computer device of claim 10, wherein the processor core of the node is further configured for

When the comparison indicates that the cacheline of the other node is the same as the cacheline of the RCLS, obtaining the cacheline from the RCLS; and

12. The computer device of claim 7, wherein the processor core of the node is further configured to, after fetching the cache line,

when the memory access is read operation, modifying the zone bit of the cache line in the caches of other nodes to be shared; and

and when the memory access is write operation, modifying the flag bit of the cache line in the caches of the other nodes to be invalid.