WO2022246769A1

WO2022246769A1 - Data access method and apparatus

Info

Publication number: WO2022246769A1
Application number: PCT/CN2021/096550
Authority: WO
Inventors: 黎卓南; 苏勇; 韩立虎
Original assignee: 华为技术有限公司
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-01
Also published as: CN116685958A

Abstract

The embodiments of the present application relate to the technical field of chips. Provided are a data access method and apparatus, which can reduce delay in the completion of atomic computation by a CPU, and reduce a collision rate of atomic operations and system overheads due to lock contending in a multi-core CPU, thereby improving the throughput of atomic operations. The method comprises: a first node receiving a plurality of first read requests sent by a plurality of cache nodes of a first cache layer, wherein the plurality of first read requests are each used for requesting a computation permission of a first address, and the first node is used for managing the coherence of the plurality of cache nodes; according to the order of first read requests respectively first sent by the plurality of cache nodes, determining the order in which the plurality of cache nodes acquire the computation permission; and when the computation permission is acquired, according to the order in which the plurality of cache nodes acquire the computation permission, controlling the computation permission to be transferred between the plurality of cache nodes. The embodiments of the present application are used for executing a data read-modify-write operation in a cache on the basis of cache coherence.

Description

A method and device for accessing data

technical field

The embodiments of the present application relate to the field of chip technology, and in particular, to a method and device for accessing data.

Background technique

Modern computer systems and multi-core chips support shared memory (shared memory) in hardware, that is, the shared memory can be accessed by multiple central processing units (Central Processing Unit, CPU) as a medium for sharing and transferring data between software processes , can improve the communication efficiency between processes. In order to ensure that after multiple CPUs read, modify and write the same memory address of the shared memory, the software can finally get the correct execution result, and various memory consistency models (memory consistency models) are proposed, that is, multiple CPUs need to follow certain access rules. Sequence rules to read and rewrite shared memory can obtain correct execution results, otherwise, the correctness of execution results is not guaranteed. Due to the different read and write rules defined by different storage consistency models, the CPU will execute instructions without dependencies in order to achieve more performance, and multiple threads are also allowed to interleave to improve throughput. In order to ensure multi-thread interleaved operation The execution sequence of read, modify and write operations in the process proposes a synchronization mechanism (synchronization). In the synchronization mechanism, the atomic access behavior of reading, modifying and writing operations on a shared variable is completed through atomic operations, and the atomic access behavior for implementing a series of instructions is achieved by using locks and critical sections. section) is completed. Whether it is an atomic operation or a lock and a critical section, in the underlying logic of the hardware, the read and write operation of the shared variable is completed through the atomic instruction (atomic instruction).

At present, computer systems and multi-core chips will add cache memory (cache), also known as cache, in the memory (memory) level to reduce the time delay of accessing memory. Usually, a multi-level cache is introduced into the system, that is, cache hierarchy or multi-level cache. In order to ensure the correctness of the data read when different CPUs access the same cache line (cache line) address at the same time, multiple CPUs need to support cache coherence when they read, modify and write the same shared variable at the same time . The cache consistency is usually implemented based on a modified-exclusive-shared-invalid (Modified-Exclusive-Shared-Invalid, MESI) coherence protocol. In the MESI consensus protocol, to rewrite a shared variable, the CPU needs to first obtain the exclusive (Exclusive, E) state of the shared variable, that is, to have the permission to rewrite the ownership of the shared variable, that is, where the shared variable is located. Therefore, the CPU also needs to obtain the E state of the shared variable when it completes the read, modify, and write operations of the shared variable through atomic instructions. However, when multiple CPUs read and rewrite the same shared variable at the same time, the E state will be frequently migrated due to contention by multiple CPUs, that is, ownership migration. However, ownership migration will cause a large system overhead (overhead), resulting in poor throughput of multi-core CPU execution of atomic instructions.

Contents of the invention

Embodiments of the present application provide a method and device for accessing data, which can improve the throughput of atomic operations performed by multi-core CPUs in a multi-level cache system architecture.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

In the first aspect, the embodiment of the present application provides a method for accessing data, the method includes: the first node receives multiple first read requests sent by multiple cache nodes in the first cache layer, and the multiple first read requests are all It is used to request the operation authority of the first address; the first node is used to manage the consistency of multiple cache nodes; the first node determines that multiple cache nodes obtain The order of computing permissions: when the first node obtains computing permissions, it controls the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions.

Thus, when multiple CPUs need to perform read and write operations on the same address at the same time, the operation authority to rewrite the data in the address will be hierarchical from bottom to top through the first node (the first node receives the first For read requests, the first node can be considered as the next level of the first cache layer.) Scheduling and transfer management is performed to transfer the computing authority between cache nodes, so that near atomic (near atomic) operations can be performed at the internal cache level of the CPU core , to reduce the delay of the CPU to complete the atomic operation. It can also avoid the entry queue congestion caused by atomic instructions queuing in public interleaving nodes in the prior art, reduce the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improve the throughput of atomic operations.

In a possible design, when the first node obtains the computing permission, according to the order of the first first read request sent by the multiple cache nodes respectively, controlling the transfer of the computing permission among the multiple cache nodes includes: the first node When the computing permission is obtained, if the first cache node is the first node to obtain the computing permission among the multiple cache nodes, the first node sends the computing permission to the first cache node; the first node obtains the first cache node from the first cache node. Data, the first data is the operation result of the first address by the first cache node, and sends the first data and operation authority to the second cache node, and the second cache node is the second node that obtains the operation authority among multiple cache nodes .

As a result, when the first cache node obtains the operation authority for the first address, it can process the first read request that the first cache node needs to process without interference from other cache nodes, and other cache nodes do not need Perform a lock grab operation. When the CPU corresponding to the first cache node executes the operation of the first address on the first cache node to obtain the original result, it can send the operation result and the operation authority of the first address to the second cache node, so that the second The CPU of the second cache node may perform operations on the first address. In this way, it can be guaranteed that the data acquired by each cache node is the latest operation result of the operation on the first address, which is used to ensure cache consistency.

In a possible design, obtaining the first data from the first cache node by the first node includes: when the first node determines that the time when the first cache node obtains the computing authority reaches the first time period, obtains the first data from the first cache node. Data; wherein, the first time period is determined by the first node according to the level at which the first node is located.

Therefore, setting the first time period can ensure that each cache node can process multiple first read requests by using the computing permission after obtaining the computing permission, avoiding the frequent migration caused by the competition for computing permission, and at the same time ensuring multiple The fairness of operations performed by each cache node on the first address.

In a possible design, the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs; the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain. In this way, for nodes in the same NUMA domain, the first node decides the transfer of computing authority among multiple cache nodes in the first cache layer, which can ensure the cache consistency of the multiple cache nodes.

In a possible design, obtaining the operation authority by the first node includes: the first node sends a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage Consistency of multiple first nodes, the second node and the multiple first nodes managed by the second node belong to the same die; the first node receives the first read response sent by the second node, and the first read response includes computing authority. In this way, for the nodes in the same die, the second node decides the transfer of computing rights between multiple first nodes, which can ensure the cache consistency of the multiple first nodes.

In a possible design, the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node. As a result of the calculation, the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die. It can be understood that the second node is a consistent node between the first node and the third node. The first node and the third node do not need to perform a lock-grabbing operation on the first address, and the first node controls the transfer of the operation authority.

In a possible design, the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs; the first node is a local agent of the cache, and is used for reading and writing operations on the memory. For example, the first node and the third node are the L2_0 node and the L2_1 node of the L2 cache layer, and the second node is the home agent. When the home agent has the operation authority to operate on the first address, it can control the operation authority to be transferred between the L2_0 node and the L2_1 node to ensure the consistency of the L2 cache layer.

In the second aspect, the embodiment of the present application provides a communication device, the communication device includes a first node and multiple cache nodes in the first cache layer, and the first node is used to: receive multiple first reads sent by multiple cache nodes Request, multiple first read requests are used to request the computing authority of the first address; the first node is used to manage the consistency of multiple cache nodes; according to the order of the first first read requests sent by multiple cache nodes Determine the order in which multiple cache nodes obtain computing permissions; when computing permissions are obtained, control the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions. The beneficial effects achieved in the second aspect can refer to the beneficial effects in the first aspect.

In a possible design, the first node is specifically used to: when obtaining the computing permission, if the first cache node is the first node to obtain the computing permission among multiple cache nodes, send the computing permission to the first cache node ; Obtain the first data from the first cache node, the first data is the operation result of the first address by the first cache node, and send the first data and operation authority to the second cache node, the second cache node is a plurality of cache nodes The second node that obtains the computing authority in .

In a possible design, the first node is specifically configured to: obtain the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches the first time period; wherein, the first time period is the first time period A node is determined according to the level of the first node.

In a possible design, the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs; the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.

In a possible design, the first node is specifically used to: send a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage multiple first nodes consistency, the second node and multiple first nodes managed by the second node belong to the same die; receive the first read response sent by the second node, and the first read response includes the computing authority.

In a possible design, the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node. As a result of the calculation, the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die.

In a possible design, the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs; the first node is a local agent of the cache, and is used for reading and writing operations on the memory.

In the third aspect, a computer-readable storage medium includes computer instructions, and when the computer instructions are run on the communication device, the communication device executes the method described in the first aspect and any possible design of the first aspect .

In a fourth aspect, a computer program product, when the computer program product is run on a computer, enables a communication device to execute the method described in the first aspect and any possible design of the first aspect.

For the beneficial effects corresponding to the above other aspects, refer to the description of the beneficial effects of the method, which will not be repeated here.

Description of drawings

FIG. 1 is a schematic diagram of a method for accessing data in the prior art;

FIG. 2 is a schematic diagram of a system architecture provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a hardware structure of a communication device provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application;

FIG. 5 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application;

FIG. 6A is a schematic structural diagram of a communication device provided by an embodiment of the present application;

FIG. 6B is a schematic structural diagram of a communication device provided by an embodiment of the present application.

Detailed ways

In order to facilitate understanding, descriptions of some concepts related to the embodiments of the present application are provided by way of example for reference. As follows:

Atomic instructions: Used to protect the critical area in the synchronization mechanism and complete the read, modify and write operations of shared variables.

Atomic operation: Refers to one or a series of operations that cannot be interrupted. In a single-core CPU, operations that can be performed in one instruction can be regarded as atomic operations. Atomic operations are not interleaved, run to completion once started, and do not switch to another thread.

Lock: Multiple threads access shared memory. In order to ensure mutually exclusive access to shared memory, it is necessary to lock the memory. Only threads that own the shared memory lock can access the shared memory. It can be understood that when multiple CPUs send instructions to access the same address, they need to perform a lock grab operation, and the CPU that grabs the lock can perform operations on the address.

Critical section: Shared memory cannot be accessed by multiple threads at the same time. When a thread enters the critical section, other threads or processes must wait.

MESI consistency protocol: Each cache line in MESI has four states, which are modify (Modify, M) state, exclusive (Exclusive, E) state, shared (Share, S) state and invalid (Invalid, I) state .

The M state means that the data in the cache line (that is, the variable in this application) has been modified, and the data is inconsistent with the data in the main memory, and the data in the current cache line shall prevail. The data in that cache line needs to be written back to main memory at some point in the future (before other CPUs are allowed to read the corresponding data in main memory). After being written back to main memory, the state of the cache line will change to E state.

The E state means that the data in the cache line is consistent with the data in the main memory, and the data only exists in the cache of the CPU, that is, the processor core corresponding to the cache at this level exclusively occupies the data and has not been modified ( clean). This state can change to S state at any time when other CPUs read the cache line, and change to M state when a CPU modifies the data in the cache line.

The S state means that the data in the cache line is consistent with the data in the main memory, and the data exists in multiple cache lines, that is, multiple processor cores share the data. When a CPU modifies the data in the cache line, the cache line in other CPUs is invalidated and becomes invalid I state.

The I state means that the data in the cache line is unavailable invalid data (the cache line may be modified by other CPUs).

Non-Unified Memory Access (NUMA): A part of the storage area manages a part of addresses. In the embodiment of this application, a cluster (cluster) can be understood as a NUMA domain.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Among them, in the description of the embodiments of this application, unless otherwise specified, "/" means or means, for example, A/B can mean A or B; "and/or" in this article is only a description of associated objects The association relationship of indicates that there may be three kinds of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. In addition, in the description of the embodiments of the present application, "plurality" refers to two or more than two.

Hereinafter, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, "plurality" means two or more.

The time delay for the CPU to complete atomic instructions at different cache levels is different. The atomic instructions that are cached and completed in the CPU can be called near atomic instructions (near atomic), and the atomic instructions that are cached and completed on the SOC side are called far atomic instructions ( far atomic), according to the definition of multi-level cache, CPU internal cache can include, for example, first-level cache (Level 1 Cache, L1 Cache), and SoC side cache can include, for example, second-level cache (Level 2 Cache, L2 Cache) Or include L2 Cache and third-level cache (Level 3 Cache, L3 Cache); or when the CPU internal cache includes L1 Cache and L2 Cache, for example, the SoC side cache can include L3 Cache, for example. It is understandable that the delay for the CPU to complete far atomic will be longer than the delay for completing near atomic. At present, in order to avoid the situation where multi-core CPUs compete for the E state, as shown in Figure 1, the atomic instructions sent by all multi-core CPUs can be aggregated to a public interleaving node through multi-level caches, and the public interleaving node sends local The agent (home agent) applies to obtain the E state of the variable to be rewritten by each CPU, and then executes the atomic instructions sent by the multi-core CPU in sequence in the common interleaving node according to the scheduling order. That is, the atomic instructions of the multi-core CPU are all completed by far atomic in the public interleaving node, so there is a large delay. Moreover, when multiple CPUs simultaneously access atomic instructions of the same memory address in the system, that is, when multiple CPUs need the same E state, a large number of atomic instructions will converge on the common interleaving node, resulting in entry queue congestion. Since the number of atomic instructions that can be accommodated in the cache (buffer) of the public interleaving node is limited, when the number of atomic instructions contained in the buffer of the public interleaving node reaches the upper limit, the rest of the atomic instructions cannot be stored in the buffer, so the CPU needs to re- Resending the atomic instruction will cause channel backpressure on the public interleaving node, which will reduce the throughput of the atomic instruction and slow down the operation speed, thereby affecting system performance.

Atomic instructions are divided into conditional (conditional) atomic instructions and unconditional (non-conditional) atomic instructions. The conditional atomic instruction needs to be judged before the atomic operation is performed on the memory of the public interleaving node. The atomic operation is performed only when the judgment is valid. The atomic operation is only performed when the compare value is the same as the memory value obtained from the memory (the data read according to the address of the atomic instruction has not been modified by other CPUs). Non-conditional atomic instructions perform atomic operations directly on the memory of public interleaving nodes, without first performing conditional judgments before performing atomic operations, such as Atomic Add, Atomic Swap and other instructions.

In a scenario where a multi-core CPU executes conditional atomic instructions on a common interleaving node, such as a mutual exclusive lock scenario, only one CPU can grab the lock, and the rest of the queued atomic instructions essentially fail to grab the lock. In this scenario, these queued atomic instructions will not only cause entry queue congestion, but also fail to grab locks. When these conditional atomic instructions perform atomic compare, most of the atomic instructions will fail to complete the atomic operation. These failed atomic instructions will not affect the operation of the system if they do not perform atomic operations. Therefore, it is unnecessary to queue up most of the failed atomic instructions, which will lead to low overall system throughput.

It is understandable that the problem of low overall system throughput is particularly serious when multi-core CPUs perform atomic operations across NUMA domains. The problem of ownership migration will occur due to the maintenance of cache consistency. In this scenario, the problem caused by ownership migration The system overhead is very high, that is, lock grabbing events across NUMA domains often occur. In particular, the smaller the workload of the critical section, the shorter the time for each CPU to grab the lock. After performing atomic operations in a short period of time, the lock will be grabbed by other CPUs. Therefore, the system overhead caused by this frequent ownership migration is bigger. In addition, the system overhead is particularly serious when cross-die (bare chip) cross-chip access is involved, and the final result is that when multi-core CPUs perform atomic operations at the same time, the throughput is low. The current existing software technology is to achieve lock rotation across NUMA domains by grabbing locks in the same NUMA domain, which is equivalent to grabbing locks between multiple CPUs in the same NUMA domain, and then transferring the lock to the next one after a period of time. In a NUMA domain, multiple CPUs in the next NUMA domain can grab locks, so as to achieve the effect of lock rotation across NUMA domains. However, there are still multiple CPUs grabbing locks, resulting in frequent ownership migration and a large system overhead. Throughput is lower.

Therefore, this application proposes a method for accessing data, which can be applied to a communication device. The communication device in this application can be understood as a chip, such as all general-purpose chips such as consumer chips and industrial chips. Considering the problem of low throughput caused by lock grabbing when multi-core CPUs perform atomic operations in the prior art, this application is in a multi-level cache system architecture where multi-core CPUs perform atomic operations on the same address at the same time, resulting in atomic operation conflicts , E-state scheduling transfer management is carried out from bottom to top through the consistency management node. After the CPU acquires the E-state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay of the CPU completing the atomic operation, which can avoid atomic instructions in public Interleaved node queuing causes entry queue congestion, reduces the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improves the throughput of atomic operations.

As shown in FIG. 2 , the embodiment of the present application can be applied to a multi-level cache system architecture. Cache is a temporary storage located between CPU and memory. Generally, cache can be divided into L1 Cache, L2 Cache, and some CPUs also have L3 Cache. When the CPU wants to read a piece of data, it first looks it up from the L1 Cache, if it does not find the data, then it looks it up from the L2 Cache, if it continues to find no data, it can look it up from the L3 Cache or memory.

Figure 2 shows a typical multi-level cache system with two cache levels, L1 Cache and L2 Cache. L1_0, L1_1, L1_2, and L1_3 represent the private L1 Cache (private L1 Cache) of CPU0, CPU1, CPU2, and CPU3 respectively. L2_0 represents the shared L2 Cache (shared L2 Cache) of CPUs in cluster0. Multiple caches in each cluster The consistency of L1 Cache is managed by the next level of L2 Cache in the same cluster, that is, L1_0, L1_1, L1_2, L1_3, and L2_0 are in the same cluster0, and L2_0 manages the consistency of L1_0, L1_1, L1_2, and L1_3. The L2 Cache in each cluster is managed by the next-level home agent in the same die. For example, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6, L1_7, L2_0, L2_1 and home agent are in the same die. And the home agent manages the consistency of L2_0 and L2_1. Therefore, for the L1 Cache in the cluster, its coherent management node is the L2 Cache in the same cluster, and for the L2 Cache, its coherent management node is the home agent in the same die.

The embodiment of the present application can be applied to a communication device, as shown in FIG. 3 , which shows a schematic diagram of the hardware structure of a communication device. The communication device can include the chip in the embodiment of the present application. In FIG. 3 , the chip 300 is exemplified. chip. The chip 300 may include a processor 301, a memory controller (memory controller) 302, a multi-level cache 303, and the like.

It can be understood that, the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the chip 300 . In other embodiments of the present application, the chip 300 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.

Wherein, the processor 301 may include one or more processing units. For example: the processor 301 may include a graphics processing unit (graphics processing unit, GPU), a central processing unit (central processing unit, CPU), and/or a neural network processor (neural network processing unit, NPU), etc. Wherein, different processing units may be independent components, or may be integrated in one or more processors. In some embodiments, the chip 300 may also include one or more processors 301, and multiple processors may be understood as multi-core CPUs.

Processor 301 may include a portion of multi-level cache 303 for storing instructions and data. Part of the multi-level cache 303 here can be understood as a CPU internal cache. In some embodiments, the internal cache of the CPU may be a cache memory, such as the above-mentioned L1 cache. The L1 cache can save the instructions or data recently used or recycled by the processor 301. If the CPU needs to use the instructions or data again, it can be directly called from the L1 cache, reducing the waiting time of the CPU and improving the efficiency of the system. In this embodiment of the application, the CPU internal cache can also be understood as L1 Cache and L2 Cache. That is, L1 Cache and L2 Cache are the internal cache levels of the CPU core, which can be used for near atomic operations on the CPU. Among them, L1 Cache is a private cache level inside the CPU, and L2 Cache is a shared cache level inside the CPU.

The processor 301 can be understood as the nerve center and command center of the chip 300 . The operation control signal can be generated according to the instruction opcode and timing signal, and the control of fetching and executing instructions can be completed.

The memory controller 302 is used to manage data read and write operations in the memory, and the memory controller 302 may also include a home agent (home agent), which may be used to implement read and write operations to the memory. In the embodiment of the present application, the home agent can be used to be responsible for the cache coherence management of the L2 cache of the chip 300. It can be understood that the home agent is located outside the CPU core and can be used for far atomic operations on the CPU. In the embodiment of the present application, the home agent can also provide the CPU with the E state when accessing the memory address.

The rest of the multi-level cache 303 can be understood as the CPU external cache, that is, the cache level on the SoC, such as L3 cache.

Applying the chip provided by the above-mentioned application, in the method of accessing data proposed by the application for the chip in conjunction with the accompanying drawings, in the scenario where the communication device, such as the multi-core CPU of the chip, performs atomic operations on the same memory address at the same time, the consistency management The process of node E-state scheduling management from bottom to top is introduced.

As shown in Figure 4, the embodiment of the present application provides a method for accessing data, taking two Cache levels (L1 Cache and L2 Cache) and one home agent level in the multi-level Cache system architecture as an example, wherein, L1 Cache And L2 Cache is the internal cache level of the CPU core, the method includes:

Step 401, the first node receives multiple first read requests sent by multiple cache nodes of the first cache layer.

In some embodiments, a plurality of first read requests are all used to request the operation authority (E state) of the first address, and any node can only obtain the operation authority of the first address to read the data in the first address Perform an overwrite operation. It can be understood that the operation authority required to rewrite data at the same memory address is the same, and the operation authority states required to rewrite data at different memory addresses are different.

The fact that the first address in this application is accessed by multiple CPUs can be understood as that the first address is read, modified and written by multiple CPUs.

In this embodiment of the present application, the first node is used to manage the consistency of multiple cache nodes in the first cache layer, that is, the first node can control the transfer of computing authority between multiple cache nodes in the first cache layer, within a period of time Only one CPU can obtain the operation authority of the first address, that is, only one CPU is allowed to rewrite the data in the first address on multiple cache nodes of the first cache layer within a period of time, thereby ensuring that multiple cache nodes cache coherency among them.

Exemplarily, the first node may be a home agent or an L2 Cache.

When the first node is a home agent, the first cache layer can be L2 Cache, and multiple cache nodes can correspond to L2_0 and L2_1 in Figure 2 respectively. In this scenario, step 401 can be understood as the home agent receiving multiple first read requests sent by multiple L2 Cache. In this case, when multiple L2 caches determine that the data of the first address is stored locally, the first read request sent by the L2 cache to the home agent is used to request the computing authority for the first address, or when the L2 cache determines that the data of the first address is not locally stored. When storing the data at the first address, the first read request sent by the L2 cache to the home agent is used to request the data at the first address and the computing authority to the first address. Before the L2 cache sends the first read request to the home agent, the L2 cache may also receive read requests sent by multiple L1 caches, which are used to request the data and computing permissions of the first address from the L2 cache, and then the L2 cache sends the home agent The agent sends a first read request to request the home agent for computing authority to the first address. It should be understood that L2_0 or L2_1 may send one or more read requests to the home agent, all of which are used to request the operation permission of the same memory address.

When the first node is L2 Cache, the first cache layer is L1 Cache, taking the first node as L2_0 in Figure 2 as an example, multiple cache nodes can include L1_0, L1_1, L1_2 and L1_3 in Figure 2. In this scenario, step 401 can be understood as the L2 Cache (that is, L2_0) receiving multiple first read requests sent by multiple L1 Cache (that is, L1_0, L1_1, L1_2, and L1_3). In this case, multiple L1 caches determine that the data of the first address is not stored locally (the initial state of the L1 cache is invalid), and the first read request sent by the L1 cache to the L2 cache is used to request the data of the first address and the The operation authority of the first address. Before the L1 cache sends the first read request to the L2 cache, the L1 cache will also receive the read request sent by the CPU to request the data and computing authority of the first address from the L1 cache, and then the L1 cache will send the first read request to the L2 cache. A read request to request the data and operation permission of the first address from the L2 cache. It should be understood that L1_0, L1_1, L1_2, and L1_3 may issue one or more first read requests to the L2 Cache, all of which are used to request data and operation permissions of the same memory address.

In some embodiments, the first read request may also be used to request data and computing permissions of the first address. Exemplarily, when the first node is the home agent, if the data of the first address is not stored in the L1 cache and the L2 cache, the first read request received by the home agent from the L2 cache is used to simultaneously request the data of the first address and operational permissions. If the data of the first address is stored in the L2 cache, the L2 cache does not need to request the data of the first address from the home agent, but only requests the operation authority of the first address.

The difference from when the first node is the home agent is that when the first node is the L2 Cache, the first cache layer is the L1 Cache, and the initial state of the L1 Cache is the I state, which can be understood as the data in the L1 Cache is invalid ,unavailable. Therefore, the cache nodes in the L1 Cache can issue the first read request to the L2 cache, and the first read request is used to request the data and computing authority of the first address. After the L2 cache receives the first read request sent by the L1 Cache, it judges whether the read request sent to the home agent is a request for the data of the first address and a read request for computing authority or only according to whether it stores the data of the first address. The operation authority of the first address is sufficient.

In step 402, the first node determines the order in which the multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by the multiple cache nodes.

Wherein, each cache node can send one or more first read requests to the first node, and it is understandable to determine the order in which multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by multiple cache nodes respectively For, multiple cache nodes are queued according to the time when they send the first first read request.

Exemplarily, when the first node is an L2 Cache, taking the first node as L2_0 in FIG. 2 as an example, multiple cache nodes may respectively correspond to L1_0, L1_1, L1_2 and L1_3 in FIG. 2 . Suppose L1_0 first sends read request 1 to L2_0, then L1_2 sends read request 2 to L2_0, then L1_0 sends read request 3 to L2_0, and finally L1_1 sends read request 4 to L2_0, then L1_0 sends two read requests , L1_0 queues according to the time of sending the first read request, that is, L1_0 queues according to the time of sending read request 1. Then L2_0 determines that the order in which these three cache nodes obtain computing permissions is L1_0-L1_2-L1_1. Since L1_3 did not send a read request, L1_3 is not in the order.

Similarly, when the first node is the home agent, when the home agent obtains the computing authority, determine the multiple cache nodes of the L2 Cache according to the order of the first first read request sent by the multiple cache nodes of the L2 Cache The order of the computing permissions.

Step 403 , when the first node obtains the computing permission, control the transfer of computing permission among multiple cache nodes according to the order in which the multiple cache nodes obtain the computing permission.

Exemplarily, when the first node is the L2 Cache, when the L2 Cache obtains the operation authority, it will control the operation authority in the order in which multiple cache nodes of the L1 Cache obtain the operation authority determined in step 402. transfer between cache nodes. Similarly, when the first node is the home agent, when the home agent obtains the operation authority, it will control the operation authority in the multiple cache nodes of the L2 Cache in the order in which the operation authority is determined in step 402. Transfer between cache nodes.

In some embodiments, when the first node (such as L2_0 in FIG. 2 ) obtains the computing permission, it sends the computing permission to the first cache node (such as L1_0 in FIG. 2 ), and after the first cache node obtains the computing permission, The CPU corresponding to the first cache node uses the computing authority to perform computation in the first cache node.

Wherein, the first cache node can be understood as the first cache node arranged in the order of the first first read requests sent by the multiple cache nodes, which is equivalent to the first node that obtains the computing authority among the multiple cache nodes.

Exemplarily, when the first node obtains the computing permission, it sends the computing permission to the first cache node, and the first cache node will process one or more first read requests in the order of the first read requests sent to the first node For processing, processing the first read request can be understood as rewriting the data in the first address, that is, completing the atomic operation. After the rewriting operation is completed on the data in the first address, new data in the first address will be generated, that is, the operation result of the first cache node on the first address.

In some embodiments, the first node obtains the first data from the first cache node, and sends the first data and the computing permission to the second cache node (for example, L1_1 in FIG. 2 ).

Wherein, the first data is an operation result of the first address by the first cache node, and the operation result can be understood as the latest operation result obtained after the first cache node processes one or more first read requests. The second cache node is the second node that obtains computing authority among the multiple cache nodes.

Exemplarily, the first node obtains from the first cache node the latest operation result after the first cache node has processed one or more first read requests, that is, the first data, and sends them to the The second cache node in the order of the first first read request sends the first data and computing permission, that is, sends the first data and computing permission to the second cache node. After the second cache node obtains the latest operation result and operation authority of the first address, it will also process one or more first read requests in the order of the first read requests sent to the first node. The latest calculation result of the first address is rewritten to update the latest calculation result of the first address.

In some embodiments, the first node acquires the first data from the first cache node when determining that the time when the first cache node acquires the computing authority reaches a first time period.

Exemplarily, when the time when the first cache node obtains the computing permission reaches the first time period, that is, when the time when the first cache node obtains the computing permission reaches the time limit, the first node will obtain the first cache node from the first cache node. A cache node rewrites the data at the first address to the latest operation result after the operation is completed, that is, the first data, and sends the first data and the operation authority to the second cache node, so as to realize the operation authority transfer.

The purpose of setting the first time period is to allow the operation permission to be transferred to any cache node to stay for a longer period of time, so that the cache node can process multiple first read requests after obtaining the operation permission, instead of just processing After a first read request, the computing authority is transferred out, thereby avoiding the frequent migration of computing authority at the cache node. At the same time, setting the first time period is also to enable each cache node that sends the first read request to obtain computing authority and ensure fairness among cache nodes. limit.

Because the range of consistency managed by different levels is different, that is, the number of managed CPUs is different. For example, L2_0 and L2_1 shown in Figure 2 manage the consistency of 4 CPUs respectively, and the home agent manages the consistency of L2_0 and L2_1, including 8 CPUs, so the delay (delay) of the computing authority at the home agent needs to be greater than the delay of the computing authority at L2_0 or L2_1, that is, the residence time of the computing authority at the home agent needs to be longer than the stay of the computing authority at L2_0 or L2_1 Time can ensure that the L2 Cache can perform normal operations after obtaining the computing authority. It is equivalent to achieving on-demand delay. If the delay is unified according to the maximum number of CPUs, the dynamic performance of the system will be lost.

Therefore, the first time period may be determined by the first node according to the level at which the first node is located.

It can also be understood that the first node determines the time to keep the computing authority in each cache node according to the number of CPUs it contains. It can also be understood that the first node jointly and adaptively determines the time to keep the computing authority in each cache node according to its own level and the number of cache nodes that send the first read request upstream.

Exemplarily, when the first node is a home agent, taking Figure 2 as an example, the home agent can determine the time that the control operation authority stays in L2_0 and L2_1 according to its own level, that is, including 8 CPUs, so as to ensure The computing authority can fairly process the read requests issued by the 8 CPUs during the residence time of the home agent. The home agent can also include 8 CPUs according to its own level, and the number of cache nodes that send the first read request upstream, that is, the number of CPUs that send read requests in the L1 Cache, for example, as shown in Figure 2 8 L1s, that is, 8 CPUs, jointly adaptively determine the time to stay in L2_0 and L2_1 for the computing authority, so as to ensure that the computing authority can fairly process the read request in the L1 Cache within the residence time of the home agent Read requests issued by multiple CPUs.

In the following, the first node is a shared cache of multiple CPUs corresponding to multiple cache nodes, and the multiple cache nodes respectively correspond to dedicated caches of multiple CPUs as an example for illustration. It can be understood that the first node is a node of the second-level L2 cache layer (L2Cache), and the first cache layer is a first-level L1 cache layer (L1 Cache).

Wherein, the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain. It can also be understood that multiple cache nodes and the first node belong to the same cluster.

Exemplarily, taking FIG. 2 as an example, when the first node is L2_0, and the multiple cache nodes are L1_0, L1_1, L1_2, and L1_3 respectively, then L1_0, L1_1, L1_2, L1_3, and L2_0 belong to the same cluster0. When the first node is L2_1, and the multiple cache nodes are L1_4, L1_5, L1_6, and L1_7, then L1_4, L1_5, L1_6, L1_7, and L2_1 belong to the same cluster1.

In some embodiments, when the first node (such as L2_0) receives the first read request from multiple cache nodes of the first cache layer, the first node sends the second read request to the second node (such as home agent), The first node receives the first read response sent by the second node, where the first read response includes a computing authority.

Wherein, the second node is used to manage the consistency of the multiple first nodes, and the second node and the multiple first nodes managed by the second node belong to the same die. When the second node can be a home agent, it can be understood that the home agent is used to manage the consistency of multiple L2 caches. Taking Figure 2 as an example, home agent, L2_0, L2_1, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7 belong to the same die.

The second read request is used to request the computing authority of the first address. The second read request can be understood as a read request sent by the L2 Cache to the home agent. When there is data at the first address in the L2 Cache, the second read request is used to request The operation authority of the first address, when there is no data of the first address in the L2 Cache, the second read request can also be used to request the data and operation authority of the first address.

Exemplarily, taking FIG. 2 as an example, L2_0 sends a second read request to the home agent, which is used to request the computing authority of the first address, and the first address is the address requested when the L1 Cache sends the read request to the L2 Cache. It can be understood that the L1 Cache sends the first read request to the L2 Cache to request the operation permission of the first address, and after receiving the first read request, the L2Cache sends the first read request to the home agent to request the operation permission of the first address. Second reading request. After the home agent receives the second read request sent by the L2 Cache, it sends the first read response to the L2 Cache, that is, sends the computing authority of the first address.

In some embodiments, when the second node determines that the time when the first node obtains the computing authority reaches the second time period, it obtains the second data from the first node, and the second node sends the second data and Operational authority.

Wherein, the second data is the latest operation result of the first address obtained by the first node. Since the L2 Cache will control the transfer of computing authority to multiple L1 Cache, each L1 Cache will update the data in the first address when using the computing authority to process one or more first read requests, so the second data can be understood as L2 Cache control After the operation authority is transferred among multiple L1 Cache, the latest operation result of the first address is obtained from the last L1 Cache where the operation authority is transferred.

The third node is a node in the same cache layer as the first node and belongs to the same die. Taking FIG. 2 as an example, when the first node is L2_0, the third node may be L2_1.

Exemplarily, when the home agent determines that the time when L2_0 acquires the operation authority reaches the second time period, it obtains the latest operation result of the first address from L2_0, and sends the latest operation result of the first address and the operation authority to L2_1.

In some embodiments, the first read request sent by multiple cache nodes of the first cache layer may be directly operated on the first node.

Exemplarily, the home agent sends the computing authority to the L2 Cache, and the first read request sent by the L1 Cache to the L2 Cache can be directly computed at the L2 Cache.

No matter whether the operation on the data of the first address is performed at the L1 Cache or the L2 Cache, it is a near atomic operation, and its operation delay is much smaller than that of the far atomic operation.

As shown in FIG. 5 , it is a flow chart of a method for accessing data provided by the embodiment of the present application. When the L1 Cache sends the first read request to the L2 Cache (in the example of rd in Figure 5), which is used to request the data and computing authority of the first address (in the example of E in Figure 5), the L2 Cache sends them separately according to multiple L1 Cache The order of the first first read request determines the order in which multiple L1 Caches obtain computing permissions. If there is no data and operation authority of the first address in the L2 Cache, the L2 Cache sends a second read request to the home agent (Rd is used as an example in Figure 5) to request the data and operation authority of the first address, and the home agent follows The order of the first and second read requests sent by multiple L2 Caches determines the order in which multiple L2 Caches obtain computing permissions.

After the home agent obtains the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L2 Cache (assumed to be L2_0) that obtains the operation authority. After L2_0 receives the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L1 Cache (assumed to be L1_0) that obtains the operation authority, and determines that the time when L1_0 obtains the operation authority reaches the first During the period of time, the latest operation result of the first address is obtained from L1_0. Afterwards, L2_0 sends the latest calculation result and the calculation permission of the first address to the second L1 Cache (assumed to be L1_1) that obtains the calculation permission, so that L1_1 can rewrite the latest calculation result of the first address. By analogy, when the last L1 Cache (assumed to be L1_3) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, because L2_0 no longer controls the operation authority to transfer, so L1_3 rewrites the operation result of the first address The latest operation result and operation authority are fed back (the feedback ACK is used as an example in Figure 5) to L2_0. At this time, the home agent determines that the time when L2_0 obtains the operation authority reaches the second time period, and the home agent obtains the latest operation result of the first address from L2_0, and sends the first address to the second L2 Cache (assumed to be L2_1) that obtains the operation authority The latest operation result and the operation authority of the address enable L2_1 to rewrite the latest operation result of the first address. By analogy, when the last L2 Cache (assumed to be L2_1) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, since the home agent no longer controls the operation authority to transfer, L2_1 writes it to the first address The latest operation results and operation permissions are fed back to the home agent. The home agent stores the latest operation result of the first address it finally receives into the shared memory, and completes a process of operation permission scheduling.

Therefore, a method for accessing data provided by the embodiment of the present application implements E-state scheduling transfer management from bottom to top through the consistency management node, sets different processing times for different levels according to the requirements of different levels, and controls the E-state Stay in the NUMA domain for a longer time to complete more atomic instructions, and after the CPU obtains the E state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay in completing atomic operations by the CPU and avoid atomic instructions being interleaved in public Node queuing causes congestion in the entry queue, reduces the conflict rate and system overhead of atomic operations, and improves the throughput of atomic operations.

FIG. 6A shows a schematic diagram of a chip structure. The chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node. The multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6A; the dedicated cache nodes of the multi-core CPU include, for example, cache node 0, cache node 1, cache node 2, and cache node in FIG. 6A. 3. Cache node 4, cache node 5, cache node 6, and cache node 7. The first node is a consistency management node of cache node 0, cache node 1, cache node 2, and cache node 3, and forms cluster 0. The third node is a consistency management node of cache node 4 , cache node 5 , cache node 6 , and cache node 7 , and forms cluster 1 . The second node and the third node belong to the same die, and the second node is a consistent node of the second node and the third node.

In the chip shown in Figure 6A, the first node can be understood as L2 Cache, multiple cache nodes can be understood as L1 Cache, at this time, the second node can be understood as home agent, and the third node is located in the same cache as the first node Another level of L2 Cache, each cache node corresponds to a CPU, the cache node is the dedicated cache of the CPU. The first node in FIG. 6A can be used to perform the above step 401, step 402 and step 403. Cache node 0-cache node 7 are in the L1 cache layer, the first node and the third node are in the L2 cache layer, and the second node is Relevant method steps when home agent, etc., and/or other processes for the techniques described herein.

FIG. 6B shows a schematic diagram of another chip structure. The chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node. The multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6B; the dedicated cache nodes of the multi-core CPU include, for example, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7. Cache node 0 is a coherent node of L1_0, L1_1, L1_2, and L1_3, and forms cluster 0; cache node 1 is a coherent node of L1_4, L1_5, L1_6, and L1_7, and forms cluster 1. The first node is a coherent node of cache node 0 and cache node 1 .

In the chip shown in FIG. 6B , the first node can be understood as a home agent, multiple cache nodes can be understood as L2 Cache, and each cache node is also used to control the dedicated cache L1 Cache of multiple CPUs. The first node in FIG. 6B can be used to perform the method steps when the first node in the

above steps

401, 402, and 403 is a home agent, and cache nodes 0 and 1 are L2 Cache, etc., and/or used in the methods described herein other processes of the technology. The embodiment of the present application also provides a computer-readable storage medium, in which computer program code is stored, and when the processor executes the computer program code, the communication device executes the method for accessing data in the above-mentioned embodiments.

Embodiments of the present application also provide a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the method for accessing data performed by the communication device in the above-mentioned embodiments.

Through the description of the above embodiments, those skilled in the art can understand that for the convenience and brevity of the description, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be assigned by different Completion of functional modules means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.

The above content is only the specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A method for accessing data, characterized in that the method comprises:

The first node receives multiple first read requests sent by multiple cache nodes in the first cache layer, and the multiple first read requests are all used to request the computing authority of the first address; the first node is used to manage all Describe the consistency of multiple cache nodes;

The first node determines the order in which the plurality of cache nodes obtain the computing authority according to the order of the first first read requests sent by the plurality of cache nodes respectively;

When the first node acquires the computing permission, it controls the transfer of the computing permission among the multiple cache nodes according to the sequence in which the multiple cache nodes acquire the computing permission.
The method according to claim 1, wherein when the first node obtains the operation authority, it controls the operation according to the order of the first first read requests sent by the plurality of cache nodes respectively. The transfer of authority among the plurality of cache nodes includes:

When the first node obtains the computing permission, if the first cache node is the first node to obtain the computing permission among the plurality of cache nodes, then the first node sends a request to the first cache node Send the computing authority;

The first node acquires first data from the first cache node, the first data is an operation result of the first address by the first cache node, and sends the first cache node to the second cache node data and the computing permission, the second cache node is the second node that obtains the computing permission among the plurality of cache nodes.
The method according to claim 2, wherein obtaining the first data from the first cache node by the first node comprises:

The first node acquires the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches a first time period;

Wherein, the first time period is determined by the first node according to the level where the first node is located.
The method according to any one of claims 1-3, wherein the first cache layer is a first-level L1 cache layer, and the multiple cache nodes correspond to the dedicated caches of multiple central processing units (CPUs);

The first node is a node of the secondary L2 cache layer, and the first node is a shared cache of multiple CPUs corresponding to the multiple cache nodes;

The multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
The method according to claim 4, wherein obtaining the computing authority by the first node comprises:

The first node sends a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage the consistency of multiple first nodes property, the second node and the plurality of first nodes managed by the second node belong to the same die;

The first node receives a first read response sent by the second node, where the first read response includes the computing authority.
The method according to claim 5, wherein the method further comprises:

When the second node determines that the time when the first node obtains the computing authority reaches a second time period, it obtains second data from the first node, and the second data is all obtained by the first node. The latest operation result of the first address, the second node sends the second data and the operation authority to a third node, and the third node is in the same cache layer as the first node and belongs to the same die of nodes.
The method according to any one of claims 1-3, wherein the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs;

The first node is a local agent of the cache, and is used for performing read and write operations on the memory.
A communication device, characterized in that the communication device includes a first node and a plurality of cache nodes in a first cache layer, and the first node is used for:

receiving a plurality of first read requests sent by the plurality of cache nodes, the plurality of first read requests are all used to request the operation authority of the first address; the first node is used to manage the plurality of cache nodes consistency;

determining the order in which the plurality of cache nodes acquire the computing authority according to the order of the first first read requests sent by the plurality of cache nodes respectively;

When the computing permission is acquired, the computing permission is controlled to be transferred among the multiple cache nodes according to the sequence in which the multiple cache nodes acquire the computing permission.
The communication device according to claim 8, wherein the first node is specifically used for:

When the computing permission is obtained, if the first cache node is the first node to obtain the computing permission among the plurality of cache nodes, send the computing permission to the first cache node;

Acquiring first data from the first cache node, the first data being an operation result of the first address by the first cache node, and sending the first data and the operation to a second cache node authority, the second cache node is the second node that obtains the computing authority among the plurality of cache nodes.
The communication device according to claim 9, wherein the first node is specifically used for:

Obtaining the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches a first time period;

Wherein, the first time period is determined by the first node according to the level where the first node is located.
The communication device according to any one of claims 8-10, wherein the first cache layer is a first-level L1 cache layer, and the multiple cache nodes correspond to dedicated caches of multiple central processing units (CPUs) respectively. ;

The first node is a node of the secondary L2 cache layer, and the first node is a shared cache of multiple CPUs corresponding to the multiple cache nodes;

The multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
The communication device according to claim 11, wherein the first node is specifically used for:

sending a second read request to a second node, where the second read request is used to request the operation authority of the first address; the second node is used to manage the consistency of multiple first nodes, and the second The second node and the plurality of first nodes managed by the second node belong to the same die;

Receive a first read response sent by the second node, where the first read response includes the computing authority.
The communication device according to claim 12, wherein:

When the second node determines that the time when the first node obtains the computing authority reaches a second time period, it obtains second data from the first node, and the second data is all obtained by the first node. The latest operation result of the first address, the second node sends the second data and the operation authority to a third node, and the third node is in the same cache layer as the first node and belongs to the same die of nodes.
The communication device according to any one of claims 8-10, wherein the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs;

The first node is a local agent of the cache, and is used for performing read and write operations on the memory.
A computer-readable storage medium is characterized by comprising computer instructions, and when the computer instructions are run on the communication device, the communication device is made to execute the method described in any one of claims 1-7.