WO2022246769A1 - Data access method and apparatus - Google Patents

Data access method and apparatus Download PDF

Info

Publication number
WO2022246769A1
WO2022246769A1 PCT/CN2021/096550 CN2021096550W WO2022246769A1 WO 2022246769 A1 WO2022246769 A1 WO 2022246769A1 CN 2021096550 W CN2021096550 W CN 2021096550W WO 2022246769 A1 WO2022246769 A1 WO 2022246769A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cache
nodes
computing
data
Prior art date
Application number
PCT/CN2021/096550
Other languages
French (fr)
Chinese (zh)
Inventor
黎卓南
苏勇
韩立虎
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/096550 priority Critical patent/WO2022246769A1/en
Priority to CN202180086851.0A priority patent/CN116685958A/en
Publication of WO2022246769A1 publication Critical patent/WO2022246769A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems

Definitions

  • the embodiments of the present application relate to the field of chip technology, and in particular, to a method and device for accessing data.
  • Modern computer systems and multi-core chips support shared memory (shared memory) in hardware, that is, the shared memory can be accessed by multiple central processing units (Central Processing Unit, CPU) as a medium for sharing and transferring data between software processes , can improve the communication efficiency between processes.
  • CPU Central Processing Unit
  • various memory consistency models memory consistency models
  • Sequence rules to read and rewrite shared memory can obtain correct execution results, otherwise, the correctness of execution results is not guaranteed.
  • the execution sequence of read, modify and write operations in the process proposes a synchronization mechanism (synchronization).
  • synchronization mechanism the atomic access behavior of reading, modifying and writing operations on a shared variable is completed through atomic operations, and the atomic access behavior for implementing a series of instructions is achieved by using locks and critical sections. section) is completed. Whether it is an atomic operation or a lock and a critical section, in the underlying logic of the hardware, the read and write operation of the shared variable is completed through the atomic instruction (atomic instruction).
  • cache also known as cache
  • memory memory
  • a multi-level cache is introduced into the system, that is, cache hierarchy or multi-level cache.
  • cache line cache line
  • multiple CPUs need to support cache coherence when they read, modify and write the same shared variable at the same time .
  • the cache consistency is usually implemented based on a modified-exclusive-shared-invalid (Modified-Exclusive-Shared-Invalid, MESI) coherence protocol.
  • MESI modified-exclusive-shared-invalid
  • the CPU In the MESI consensus protocol, to rewrite a shared variable, the CPU needs to first obtain the exclusive (Exclusive, E) state of the shared variable, that is, to have the permission to rewrite the ownership of the shared variable, that is, where the shared variable is located. Therefore, the CPU also needs to obtain the E state of the shared variable when it completes the read, modify, and write operations of the shared variable through atomic instructions.
  • E exclusive
  • the CPUs when multiple CPUs read and rewrite the same shared variable at the same time, the E state will be frequently migrated due to contention by multiple CPUs, that is, ownership migration.
  • ownership migration will cause a large system overhead (overhead), resulting in poor throughput of multi-core CPU execution of atomic instructions.
  • Embodiments of the present application provide a method and device for accessing data, which can improve the throughput of atomic operations performed by multi-core CPUs in a multi-level cache system architecture.
  • the embodiment of the present application provides a method for accessing data, the method includes: the first node receives multiple first read requests sent by multiple cache nodes in the first cache layer, and the multiple first read requests are all It is used to request the operation authority of the first address; the first node is used to manage the consistency of multiple cache nodes; the first node determines that multiple cache nodes obtain The order of computing permissions: when the first node obtains computing permissions, it controls the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions.
  • the operation authority to rewrite the data in the address will be hierarchical from bottom to top through the first node (the first node receives the first For read requests, the first node can be considered as the next level of the first cache layer.) Scheduling and transfer management is performed to transfer the computing authority between cache nodes, so that near atomic (near atomic) operations can be performed at the internal cache level of the CPU core , to reduce the delay of the CPU to complete the atomic operation. It can also avoid the entry queue congestion caused by atomic instructions queuing in public interleaving nodes in the prior art, reduce the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improve the throughput of atomic operations.
  • controlling the transfer of the computing permission among the multiple cache nodes includes: the first node When the computing permission is obtained, if the first cache node is the first node to obtain the computing permission among the multiple cache nodes, the first node sends the computing permission to the first cache node; the first node obtains the first cache node from the first cache node.
  • Data the first data is the operation result of the first address by the first cache node, and sends the first data and operation authority to the second cache node, and the second cache node is the second node that obtains the operation authority among multiple cache nodes .
  • the first cache node when the first cache node obtains the operation authority for the first address, it can process the first read request that the first cache node needs to process without interference from other cache nodes, and other cache nodes do not need Perform a lock grab operation.
  • the CPU corresponding to the first cache node executes the operation of the first address on the first cache node to obtain the original result, it can send the operation result and the operation authority of the first address to the second cache node, so that the second The CPU of the second cache node may perform operations on the first address. In this way, it can be guaranteed that the data acquired by each cache node is the latest operation result of the operation on the first address, which is used to ensure cache consistency.
  • obtaining the first data from the first cache node by the first node includes: when the first node determines that the time when the first cache node obtains the computing authority reaches the first time period, obtains the first data from the first cache node. Data; wherein, the first time period is determined by the first node according to the level at which the first node is located.
  • setting the first time period can ensure that each cache node can process multiple first read requests by using the computing permission after obtaining the computing permission, avoiding the frequent migration caused by the competition for computing permission, and at the same time ensuring multiple The fairness of operations performed by each cache node on the first address.
  • the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs; the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
  • the first node decides the transfer of computing authority among multiple cache nodes in the first cache layer, which can ensure the cache consistency of the multiple cache nodes.
  • obtaining the operation authority by the first node includes: the first node sends a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage Consistency of multiple first nodes, the second node and the multiple first nodes managed by the second node belong to the same die; the first node receives the first read response sent by the second node, and the first read response includes computing authority.
  • the second node decides the transfer of computing rights between multiple first nodes, which can ensure the cache consistency of the multiple first nodes.
  • the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node.
  • the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die. It can be understood that the second node is a consistent node between the first node and the third node. The first node and the third node do not need to perform a lock-grabbing operation on the first address, and the first node controls the transfer of the operation authority.
  • the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs; the first node is a local agent of the cache, and is used for reading and writing operations on the memory.
  • the first node and the third node are the L2_0 node and the L2_1 node of the L2 cache layer, and the second node is the home agent.
  • the home agent has the operation authority to operate on the first address, it can control the operation authority to be transferred between the L2_0 node and the L2_1 node to ensure the consistency of the L2 cache layer.
  • the embodiment of the present application provides a communication device, the communication device includes a first node and multiple cache nodes in the first cache layer, and the first node is used to: receive multiple first reads sent by multiple cache nodes Request, multiple first read requests are used to request the computing authority of the first address; the first node is used to manage the consistency of multiple cache nodes; according to the order of the first first read requests sent by multiple cache nodes Determine the order in which multiple cache nodes obtain computing permissions; when computing permissions are obtained, control the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions.
  • the beneficial effects achieved in the second aspect can refer to the beneficial effects in the first aspect.
  • the first node is specifically used to: when obtaining the computing permission, if the first cache node is the first node to obtain the computing permission among multiple cache nodes, send the computing permission to the first cache node ; Obtain the first data from the first cache node, the first data is the operation result of the first address by the first cache node, and send the first data and operation authority to the second cache node, the second cache node is a plurality of cache nodes The second node that obtains the computing authority in .
  • the first node is specifically configured to: obtain the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches the first time period; wherein, the first time period is the first time period A node is determined according to the level of the first node.
  • the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs;
  • the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
  • the first node is specifically used to: send a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage multiple first nodes consistency, the second node and multiple first nodes managed by the second node belong to the same die; receive the first read response sent by the second node, and the first read response includes the computing authority.
  • the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node.
  • the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die.
  • the first cache layer is a secondary L2 cache layer
  • the multiple cache nodes are respectively shared caches of multiple CPUs
  • the first node is a local agent of the cache, and is used for reading and writing operations on the memory.
  • a computer-readable storage medium includes computer instructions, and when the computer instructions are run on the communication device, the communication device executes the method described in the first aspect and any possible design of the first aspect .
  • a computer program product when the computer program product is run on a computer, enables a communication device to execute the method described in the first aspect and any possible design of the first aspect.
  • FIG. 1 is a schematic diagram of a method for accessing data in the prior art
  • FIG. 2 is a schematic diagram of a system architecture provided in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a hardware structure of a communication device provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application.
  • FIG. 6A is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • FIG. 6B is a schematic structural diagram of a communication device provided by an embodiment of the present application.
  • Atomic instructions Used to protect the critical area in the synchronization mechanism and complete the read, modify and write operations of shared variables.
  • Atomic operation refers to one or a series of operations that cannot be interrupted. In a single-core CPU, operations that can be performed in one instruction can be regarded as atomic operations. Atomic operations are not interleaved, run to completion once started, and do not switch to another thread.
  • Critical section Shared memory cannot be accessed by multiple threads at the same time. When a thread enters the critical section, other threads or processes must wait.
  • Each cache line in MESI has four states, which are modify (Modify, M) state, exclusive (Exclusive, E) state, shared (Share, S) state and invalid (Invalid, I) state .
  • the M state means that the data in the cache line (that is, the variable in this application) has been modified, and the data is inconsistent with the data in the main memory, and the data in the current cache line shall prevail.
  • the data in that cache line needs to be written back to main memory at some point in the future (before other CPUs are allowed to read the corresponding data in main memory). After being written back to main memory, the state of the cache line will change to E state.
  • the E state means that the data in the cache line is consistent with the data in the main memory, and the data only exists in the cache of the CPU, that is, the processor core corresponding to the cache at this level exclusively occupies the data and has not been modified ( clean).
  • This state can change to S state at any time when other CPUs read the cache line, and change to M state when a CPU modifies the data in the cache line.
  • the S state means that the data in the cache line is consistent with the data in the main memory, and the data exists in multiple cache lines, that is, multiple processor cores share the data.
  • the cache line in other CPUs is invalidated and becomes invalid I state.
  • the I state means that the data in the cache line is unavailable invalid data (the cache line may be modified by other CPUs).
  • Non-Unified Memory Access A part of the storage area manages a part of addresses.
  • a cluster can be understood as a NUMA domain.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
  • CPU internal cache can include, for example, first-level cache (Level 1 Cache, L1 Cache), and SoC side cache can include, for example, second-level cache (Level 2 Cache, L2 Cache) Or include L2 Cache and third-level cache (Level 3 Cache, L3 Cache); or when the CPU internal cache includes L1 Cache and L2 Cache, for example, the SoC side cache can include L3 Cache, for example.
  • first-level cache Level 1 Cache, L1 Cache
  • SoC side cache can include, for example, second-level cache (Level 2 Cache, L2 Cache) Or include L2 Cache and third-level cache (Level 3 Cache, L3 Cache); or when the CPU internal cache includes L1 Cache and L2 Cache, for example, the SoC side cache can include L3 Cache, for example.
  • the delay for the CPU to complete far atomic will be longer than the delay for completing near atomic.
  • the atomic instructions sent by all multi-core CPUs can be aggregated to a public interleaving node through multi-level caches, and the public interleaving node sends local
  • the agent home agent
  • the agent applies to obtain the E state of the variable to be rewritten by each CPU, and then executes the atomic instructions sent by the multi-core CPU in sequence in the common interleaving node according to the scheduling order.
  • the atomic instructions of the multi-core CPU are all completed by far atomic in the public interleaving node, so there is a large delay.
  • atomic instructions of the multi-core CPU are all completed by far atomic in the public interleaving node, so there is a large delay.
  • multiple CPUs simultaneously access atomic instructions of the same memory address in the system that is, when multiple CPUs need the same E state, a large number of atomic instructions will converge on the common interleaving node, resulting in entry queue congestion.
  • Atomic instructions are divided into conditional (conditional) atomic instructions and unconditional (non-conditional) atomic instructions.
  • the conditional atomic instruction needs to be judged before the atomic operation is performed on the memory of the public interleaving node.
  • the atomic operation is performed only when the judgment is valid.
  • the atomic operation is only performed when the compare value is the same as the memory value obtained from the memory (the data read according to the address of the atomic instruction has not been modified by other CPUs).
  • Non-conditional atomic instructions perform atomic operations directly on the memory of public interleaving nodes, without first performing conditional judgments before performing atomic operations, such as Atomic Add, Atomic Swap and other instructions.
  • a multi-core CPU executes conditional atomic instructions on a common interleaving node, such as a mutual exclusive lock scenario
  • a mutual exclusive lock scenario only one CPU can grab the lock, and the rest of the queued atomic instructions essentially fail to grab the lock.
  • these queued atomic instructions will not only cause entry queue congestion, but also fail to grab locks.
  • these conditional atomic instructions perform atomic compare, most of the atomic instructions will fail to complete the atomic operation.
  • These failed atomic instructions will not affect the operation of the system if they do not perform atomic operations. Therefore, it is unnecessary to queue up most of the failed atomic instructions, which will lead to low overall system throughput.
  • the current existing software technology is to achieve lock rotation across NUMA domains by grabbing locks in the same NUMA domain, which is equivalent to grabbing locks between multiple CPUs in the same NUMA domain, and then transferring the lock to the next one after a period of time.
  • multiple CPUs in the next NUMA domain can grab locks, so as to achieve the effect of lock rotation across NUMA domains.
  • there are still multiple CPUs grabbing locks resulting in frequent ownership migration and a large system overhead. Throughput is lower.
  • this application proposes a method for accessing data, which can be applied to a communication device.
  • the communication device in this application can be understood as a chip, such as all general-purpose chips such as consumer chips and industrial chips.
  • this application is in a multi-level cache system architecture where multi-core CPUs perform atomic operations on the same address at the same time, resulting in atomic operation conflicts , E-state scheduling transfer management is carried out from bottom to top through the consistency management node.
  • the CPU After the CPU acquires the E-state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay of the CPU completing the atomic operation, which can avoid atomic instructions in public Interleaved node queuing causes entry queue congestion, reduces the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improves the throughput of atomic operations.
  • Cache is a temporary storage located between CPU and memory.
  • cache can be divided into L1 Cache, L2 Cache, and some CPUs also have L3 Cache.
  • the CPU wants to read a piece of data, it first looks it up from the L1 Cache, if it does not find the data, then it looks it up from the L2 Cache, if it continues to find no data, it can look it up from the L3 Cache or memory.
  • FIG. 2 shows a typical multi-level cache system with two cache levels, L1 Cache and L2 Cache.
  • L1_0, L1_1, L1_2, and L1_3 represent the private L1 Cache (private L1 Cache) of CPU0, CPU1, CPU2, and CPU3 respectively.
  • L2_0 represents the shared L2 Cache (shared L2 Cache) of CPUs in cluster0. Multiple caches in each cluster The consistency of L1 Cache is managed by the next level of L2 Cache in the same cluster, that is, L1_0, L1_1, L1_2, L1_3, and L2_0 are in the same cluster0, and L2_0 manages the consistency of L1_0, L1_1, L1_2, and L1_3.
  • the L2 Cache in each cluster is managed by the next-level home agent in the same die.
  • L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6, L1_7, L2_0, L2_1 and home agent are in the same die.
  • the home agent manages the consistency of L2_0 and L2_1. Therefore, for the L1 Cache in the cluster, its coherent management node is the L2 Cache in the same cluster, and for the L2 Cache, its coherent management node is the home agent in the same die.
  • the embodiment of the present application can be applied to a communication device, as shown in FIG. 3 , which shows a schematic diagram of the hardware structure of a communication device.
  • the communication device can include the chip in the embodiment of the present application.
  • the chip 300 is exemplified. chip.
  • the chip 300 may include a processor 301, a memory controller (memory controller) 302, a multi-level cache 303, and the like.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the chip 300 .
  • the chip 300 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 301 may include one or more processing units.
  • the processor 301 may include a graphics processing unit (graphics processing unit, GPU), a central processing unit (central processing unit, CPU), and/or a neural network processor (neural network processing unit, NPU), etc.
  • graphics processing unit graphics processing unit
  • CPU central processing unit
  • NPU neural network processing unit
  • different processing units may be independent components, or may be integrated in one or more processors.
  • the chip 300 may also include one or more processors 301, and multiple processors may be understood as multi-core CPUs.
  • Processor 301 may include a portion of multi-level cache 303 for storing instructions and data.
  • Part of the multi-level cache 303 here can be understood as a CPU internal cache.
  • the internal cache of the CPU may be a cache memory, such as the above-mentioned L1 cache.
  • the L1 cache can save the instructions or data recently used or recycled by the processor 301. If the CPU needs to use the instructions or data again, it can be directly called from the L1 cache, reducing the waiting time of the CPU and improving the efficiency of the system.
  • the CPU internal cache can also be understood as L1 Cache and L2 Cache.
  • L1 Cache and L2 Cache are the internal cache levels of the CPU core, which can be used for near atomic operations on the CPU.
  • L1 Cache is a private cache level inside the CPU
  • L2 Cache is a shared cache level inside the CPU.
  • the processor 301 can be understood as the nerve center and command center of the chip 300 .
  • the operation control signal can be generated according to the instruction opcode and timing signal, and the control of fetching and executing instructions can be completed.
  • the memory controller 302 is used to manage data read and write operations in the memory, and the memory controller 302 may also include a home agent (home agent), which may be used to implement read and write operations to the memory.
  • the home agent can be used to be responsible for the cache coherence management of the L2 cache of the chip 300. It can be understood that the home agent is located outside the CPU core and can be used for far atomic operations on the CPU. In the embodiment of the present application, the home agent can also provide the CPU with the E state when accessing the memory address.
  • the rest of the multi-level cache 303 can be understood as the CPU external cache, that is, the cache level on the SoC, such as L3 cache.
  • the embodiment of the present application provides a method for accessing data, taking two Cache levels (L1 Cache and L2 Cache) and one home agent level in the multi-level Cache system architecture as an example, wherein, L1 Cache And L2 Cache is the internal cache level of the CPU core, the method includes:
  • Step 401 the first node receives multiple first read requests sent by multiple cache nodes of the first cache layer.
  • a plurality of first read requests are all used to request the operation authority (E state) of the first address, and any node can only obtain the operation authority of the first address to read the data in the first address Perform an overwrite operation. It can be understood that the operation authority required to rewrite data at the same memory address is the same, and the operation authority states required to rewrite data at different memory addresses are different.
  • the fact that the first address in this application is accessed by multiple CPUs can be understood as that the first address is read, modified and written by multiple CPUs.
  • the first node is used to manage the consistency of multiple cache nodes in the first cache layer, that is, the first node can control the transfer of computing authority between multiple cache nodes in the first cache layer, within a period of time Only one CPU can obtain the operation authority of the first address, that is, only one CPU is allowed to rewrite the data in the first address on multiple cache nodes of the first cache layer within a period of time, thereby ensuring that multiple cache nodes cache coherency among them.
  • the first node may be a home agent or an L2 Cache.
  • the first cache layer can be L2 Cache, and multiple cache nodes can correspond to L2_0 and L2_1 in Figure 2 respectively.
  • step 401 can be understood as the home agent receiving multiple first read requests sent by multiple L2 Cache.
  • the first read request sent by the L2 cache to the home agent is used to request the computing authority for the first address, or when the L2 cache determines that the data of the first address is not locally stored.
  • the first read request sent by the L2 cache to the home agent is used to request the data at the first address and the computing authority to the first address.
  • the L2 cache may also receive read requests sent by multiple L1 caches, which are used to request the data and computing permissions of the first address from the L2 cache, and then the L2 cache sends the home agent The agent sends a first read request to request the home agent for computing authority to the first address.
  • L2_0 or L2_1 may send one or more read requests to the home agent, all of which are used to request the operation permission of the same memory address.
  • step 401 can be understood as the L2 Cache (that is, L2_0) receiving multiple first read requests sent by multiple L1 Cache (that is, L1_0, L1_1, L1_2, and L1_3).
  • multiple L1 caches determine that the data of the first address is not stored locally (the initial state of the L1 cache is invalid), and the first read request sent by the L1 cache to the L2 cache is used to request the data of the first address and the The operation authority of the first address.
  • the L1 cache Before the L1 cache sends the first read request to the L2 cache, the L1 cache will also receive the read request sent by the CPU to request the data and computing authority of the first address from the L1 cache, and then the L1 cache will send the first read request to the L2 cache.
  • a read request to request the data and operation permission of the first address from the L2 cache.
  • L1_0, L1_1, L1_2, and L1_3 may issue one or more first read requests to the L2 Cache, all of which are used to request data and operation permissions of the same memory address.
  • the first read request may also be used to request data and computing permissions of the first address.
  • the first node is the home agent
  • the first read request received by the home agent from the L2 cache is used to simultaneously request the data of the first address and operational permissions. If the data of the first address is stored in the L2 cache, the L2 cache does not need to request the data of the first address from the home agent, but only requests the operation authority of the first address.
  • the difference from when the first node is the home agent is that when the first node is the L2 Cache, the first cache layer is the L1 Cache, and the initial state of the L1 Cache is the I state, which can be understood as the data in the L1 Cache is invalid ,unavailable. Therefore, the cache nodes in the L1 Cache can issue the first read request to the L2 cache, and the first read request is used to request the data and computing authority of the first address.
  • the L2 cache After the L2 cache receives the first read request sent by the L1 Cache, it judges whether the read request sent to the home agent is a request for the data of the first address and a read request for computing authority or only according to whether it stores the data of the first address. The operation authority of the first address is sufficient.
  • the first node determines the order in which the multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by the multiple cache nodes.
  • each cache node can send one or more first read requests to the first node, and it is understandable to determine the order in which multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by multiple cache nodes respectively For, multiple cache nodes are queued according to the time when they send the first first read request.
  • L1_0 first sends read request 1 to L2_0, then L1_2 sends read request 2 to L2_0, then L1_0 sends read request 3 to L2_0, and finally L1_1 sends read request 4 to L2_0, then L1_0 sends two read requests , L1_0 queues according to the time of sending the first read request, that is, L1_0 queues according to the time of sending read request 1.
  • L2_0 determines that the order in which these three cache nodes obtain computing permissions is L1_0-L1_2-L1_1. Since L1_3 did not send a read request, L1_3 is not in the order.
  • the home agent when the home agent obtains the computing authority, determine the multiple cache nodes of the L2 Cache according to the order of the first first read request sent by the multiple cache nodes of the L2 Cache The order of the computing permissions.
  • Step 403 when the first node obtains the computing permission, control the transfer of computing permission among multiple cache nodes according to the order in which the multiple cache nodes obtain the computing permission.
  • the first node when the first node is the L2 Cache, when the L2 Cache obtains the operation authority, it will control the operation authority in the order in which multiple cache nodes of the L1 Cache obtain the operation authority determined in step 402. transfer between cache nodes.
  • the first node when the first node is the home agent, when the home agent obtains the operation authority, it will control the operation authority in the multiple cache nodes of the L2 Cache in the order in which the operation authority is determined in step 402. Transfer between cache nodes.
  • the first node when the first node (such as L2_0 in FIG. 2 ) obtains the computing permission, it sends the computing permission to the first cache node (such as L1_0 in FIG. 2 ), and after the first cache node obtains the computing permission, The CPU corresponding to the first cache node uses the computing authority to perform computation in the first cache node.
  • the first cache node can be understood as the first cache node arranged in the order of the first first read requests sent by the multiple cache nodes, which is equivalent to the first node that obtains the computing authority among the multiple cache nodes.
  • processing the first read request can be understood as rewriting the data in the first address, that is, completing the atomic operation. After the rewriting operation is completed on the data in the first address, new data in the first address will be generated, that is, the operation result of the first cache node on the first address.
  • the first node obtains the first data from the first cache node, and sends the first data and the computing permission to the second cache node (for example, L1_1 in FIG. 2 ).
  • the first data is an operation result of the first address by the first cache node, and the operation result can be understood as the latest operation result obtained after the first cache node processes one or more first read requests.
  • the second cache node is the second node that obtains computing authority among the multiple cache nodes.
  • the first node obtains from the first cache node the latest operation result after the first cache node has processed one or more first read requests, that is, the first data, and sends them to the The second cache node in the order of the first first read request sends the first data and computing permission, that is, sends the first data and computing permission to the second cache node.
  • the second cache node obtains the latest operation result and operation authority of the first address, it will also process one or more first read requests in the order of the first read requests sent to the first node.
  • the latest calculation result of the first address is rewritten to update the latest calculation result of the first address.
  • the first node acquires the first data from the first cache node when determining that the time when the first cache node acquires the computing authority reaches a first time period.
  • the first node when the time when the first cache node obtains the computing permission reaches the first time period, that is, when the time when the first cache node obtains the computing permission reaches the time limit, the first node will obtain the first cache node from the first cache node.
  • a cache node rewrites the data at the first address to the latest operation result after the operation is completed, that is, the first data, and sends the first data and the operation authority to the second cache node, so as to realize the operation authority transfer.
  • the purpose of setting the first time period is to allow the operation permission to be transferred to any cache node to stay for a longer period of time, so that the cache node can process multiple first read requests after obtaining the operation permission, instead of just processing After a first read request, the computing authority is transferred out, thereby avoiding the frequent migration of computing authority at the cache node.
  • setting the first time period is also to enable each cache node that sends the first read request to obtain computing authority and ensure fairness among cache nodes. limit.
  • L2_0 and L2_1 shown in Figure 2 manage the consistency of 4 CPUs respectively, and the home agent manages the consistency of L2_0 and L2_1, including 8 CPUs, so the delay (delay) of the computing authority at the home agent needs to be greater than the delay of the computing authority at L2_0 or L2_1, that is, the residence time of the computing authority at the home agent needs to be longer than the stay of the computing authority at L2_0 or L2_1 Time can ensure that the L2 Cache can perform normal operations after obtaining the computing authority. It is equivalent to achieving on-demand delay. If the delay is unified according to the maximum number of CPUs, the dynamic performance of the system will be lost.
  • the first time period may be determined by the first node according to the level at which the first node is located.
  • the first node determines the time to keep the computing authority in each cache node according to the number of CPUs it contains. It can also be understood that the first node jointly and adaptively determines the time to keep the computing authority in each cache node according to its own level and the number of cache nodes that send the first read request upstream.
  • the home agent can determine the time that the control operation authority stays in L2_0 and L2_1 according to its own level, that is, including 8 CPUs, so as to ensure The computing authority can fairly process the read requests issued by the 8 CPUs during the residence time of the home agent.
  • the home agent can also include 8 CPUs according to its own level, and the number of cache nodes that send the first read request upstream, that is, the number of CPUs that send read requests in the L1 Cache, for example, as shown in Figure 2 8 L1s, that is, 8 CPUs, jointly adaptively determine the time to stay in L2_0 and L2_1 for the computing authority, so as to ensure that the computing authority can fairly process the read request in the L1 Cache within the residence time of the home agent Read requests issued by multiple CPUs.
  • the first node is a shared cache of multiple CPUs corresponding to multiple cache nodes, and the multiple cache nodes respectively correspond to dedicated caches of multiple CPUs as an example for illustration. It can be understood that the first node is a node of the second-level L2 cache layer (L2Cache), and the first cache layer is a first-level L1 cache layer (L1 Cache).
  • L2Cache second-level L2 cache layer
  • L1 Cache first-level L1 cache layer
  • the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain. It can also be understood that multiple cache nodes and the first node belong to the same cluster.
  • the first node when the first node (such as L2_0) receives the first read request from multiple cache nodes of the first cache layer, the first node sends the second read request to the second node (such as home agent), The first node receives the first read response sent by the second node, where the first read response includes a computing authority.
  • the second node is used to manage the consistency of the multiple first nodes, and the second node and the multiple first nodes managed by the second node belong to the same die.
  • the second node can be a home agent
  • the home agent is used to manage the consistency of multiple L2 caches. Taking Figure 2 as an example, home agent, L2_0, L2_1, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7 belong to the same die.
  • the second read request is used to request the computing authority of the first address.
  • the second read request can be understood as a read request sent by the L2 Cache to the home agent. When there is data at the first address in the L2 Cache, the second read request is used to request The operation authority of the first address, when there is no data of the first address in the L2 Cache, the second read request can also be used to request the data and operation authority of the first address.
  • L2_0 sends a second read request to the home agent, which is used to request the computing authority of the first address, and the first address is the address requested when the L1 Cache sends the read request to the L2 Cache.
  • the L1 Cache sends the first read request to the L2 Cache to request the operation permission of the first address
  • the L2Cache sends the first read request to the home agent to request the operation permission of the first address.
  • Second reading request After the home agent receives the second read request sent by the L2 Cache, it sends the first read response to the L2 Cache, that is, sends the computing authority of the first address.
  • the second node when the second node determines that the time when the first node obtains the computing authority reaches the second time period, it obtains the second data from the first node, and the second node sends the second data and Operational authority.
  • the second data is the latest operation result of the first address obtained by the first node. Since the L2 Cache will control the transfer of computing authority to multiple L1 Cache, each L1 Cache will update the data in the first address when using the computing authority to process one or more first read requests, so the second data can be understood as L2 Cache control After the operation authority is transferred among multiple L1 Cache, the latest operation result of the first address is obtained from the last L1 Cache where the operation authority is transferred.
  • the third node is a node in the same cache layer as the first node and belongs to the same die. Taking FIG. 2 as an example, when the first node is L2_0, the third node may be L2_1.
  • the home agent determines that the time when L2_0 acquires the operation authority reaches the second time period, it obtains the latest operation result of the first address from L2_0, and sends the latest operation result of the first address and the operation authority to L2_1.
  • the first read request sent by multiple cache nodes of the first cache layer may be directly operated on the first node.
  • the home agent sends the computing authority to the L2 Cache, and the first read request sent by the L1 Cache to the L2 Cache can be directly computed at the L2 Cache.
  • FIG. 5 it is a flow chart of a method for accessing data provided by the embodiment of the present application.
  • the L1 Cache sends the first read request to the L2 Cache (in the example of rd in Figure 5), which is used to request the data and computing authority of the first address (in the example of E in Figure 5)
  • the L2 Cache sends them separately according to multiple L1 Cache
  • the order of the first first read request determines the order in which multiple L1 Caches obtain computing permissions.
  • the L2 Cache sends a second read request to the home agent (Rd is used as an example in Figure 5) to request the data and operation authority of the first address, and the home agent follows
  • the order of the first and second read requests sent by multiple L2 Caches determines the order in which multiple L2 Caches obtain computing permissions.
  • the home agent After the home agent obtains the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L2 Cache (assumed to be L2_0) that obtains the operation authority.
  • L2_0 receives the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L1 Cache (assumed to be L1_0) that obtains the operation authority, and determines that the time when L1_0 obtains the operation authority reaches the first During the period of time, the latest operation result of the first address is obtained from L1_0.
  • L2_0 sends the latest calculation result and the calculation permission of the first address to the second L1 Cache (assumed to be L1_1) that obtains the calculation permission, so that L1_1 can rewrite the latest calculation result of the first address.
  • L1_3 the last L1 Cache (assumed to be L1_3) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, because L2_0 no longer controls the operation authority to transfer, so L1_3 rewrites the operation result of the first address
  • the latest operation result and operation authority are fed back (the feedback ACK is used as an example in Figure 5) to L2_0.
  • the home agent determines that the time when L2_0 obtains the operation authority reaches the second time period, and the home agent obtains the latest operation result of the first address from L2_0, and sends the first address to the second L2 Cache (assumed to be L2_1) that obtains the operation authority
  • the latest operation result and the operation authority of the address enable L2_1 to rewrite the latest operation result of the first address.
  • L2_1 When the last L2 Cache (assumed to be L2_1) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, since the home agent no longer controls the operation authority to transfer, L2_1 writes it to the first address The latest operation results and operation permissions are fed back to the home agent.
  • the home agent stores the latest operation result of the first address it finally receives into the shared memory, and completes a process of operation permission scheduling.
  • a method for accessing data implements E-state scheduling transfer management from bottom to top through the consistency management node, sets different processing times for different levels according to the requirements of different levels, and controls the E-state Stay in the NUMA domain for a longer time to complete more atomic instructions, and after the CPU obtains the E state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay in completing atomic operations by the CPU and avoid atomic instructions being interleaved in public Node queuing causes congestion in the entry queue, reduces the conflict rate and system overhead of atomic operations, and improves the throughput of atomic operations.
  • FIG. 6A shows a schematic diagram of a chip structure.
  • the chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node.
  • the multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6A;
  • the dedicated cache nodes of the multi-core CPU include, for example, cache node 0, cache node 1, cache node 2, and cache node in FIG. 6A. 3.
  • Cache node 4 cache node 5, cache node 6, and cache node 7.
  • the first node is a consistency management node of cache node 0, cache node 1, cache node 2, and cache node 3, and forms cluster 0.
  • the third node is a consistency management node of cache node 4 , cache node 5 , cache node 6 , and cache node 7 , and forms cluster 1 .
  • the second node and the third node belong to the same die, and the second node is a consistent node of the second node and the third node.
  • the first node can be understood as L2 Cache
  • multiple cache nodes can be understood as L1 Cache
  • the second node can be understood as home agent
  • the third node is located in the same cache as the first node
  • each cache node corresponds to a CPU
  • the cache node is the dedicated cache of the CPU.
  • the first node in FIG. 6A can be used to perform the above step 401, step 402 and step 403.
  • Cache node 0-cache node 7 are in the L1 cache layer
  • the first node and the third node are in the L2 cache layer
  • the second node is Relevant method steps when home agent, etc., and/or other processes for the techniques described herein.
  • FIG. 6B shows a schematic diagram of another chip structure.
  • the chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node.
  • the multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6B;
  • the dedicated cache nodes of the multi-core CPU include, for example, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7.
  • Cache node 0 is a coherent node of L1_0, L1_1, L1_2, and L1_3, and forms cluster 0; cache node 1 is a coherent node of L1_4, L1_5, L1_6, and L1_7, and forms cluster 1.
  • the first node is a coherent node of cache node 0 and cache node 1 .
  • the first node can be understood as a home agent
  • multiple cache nodes can be understood as L2 Cache
  • each cache node is also used to control the dedicated cache L1 Cache of multiple CPUs.
  • the first node in FIG. 6B can be used to perform the method steps when the first node in the above steps 401, 402, and 403 is a home agent, and cache nodes 0 and 1 are L2 Cache, etc., and/or used in the methods described herein other processes of the technology.
  • the embodiment of the present application also provides a computer-readable storage medium, in which computer program code is stored, and when the processor executes the computer program code, the communication device executes the method for accessing data in the above-mentioned embodiments.
  • Embodiments of the present application also provide a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the method for accessing data performed by the communication device in the above-mentioned embodiments.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiments of the present application relate to the technical field of chips. Provided are a data access method and apparatus, which can reduce delay in the completion of atomic computation by a CPU, and reduce a collision rate of atomic operations and system overheads due to lock contending in a multi-core CPU, thereby improving the throughput of atomic operations. The method comprises: a first node receiving a plurality of first read requests sent by a plurality of cache nodes of a first cache layer, wherein the plurality of first read requests are each used for requesting a computation permission of a first address, and the first node is used for managing the coherence of the plurality of cache nodes; according to the order of first read requests respectively first sent by the plurality of cache nodes, determining the order in which the plurality of cache nodes acquire the computation permission; and when the computation permission is acquired, according to the order in which the plurality of cache nodes acquire the computation permission, controlling the computation permission to be transferred between the plurality of cache nodes. The embodiments of the present application are used for executing a data read-modify-write operation in a cache on the basis of cache coherence.

Description

一种访问数据的方法和装置A method and device for accessing data 技术领域technical field
本申请实施例涉及芯片技术领域,尤其涉及一种访问数据的方法和装置。The embodiments of the present application relate to the field of chip technology, and in particular, to a method and device for accessing data.
背景技术Background technique
现代计算机系统和多核芯片在硬件上都支持共享内存(shared memory),即该共享内存可以被多个中央处理器(Central Processing Unit,CPU)访问,以作为软件进程间共享和传递数据的一个媒介,可以提高进程间的通信效率。为了保证多个CPU对共享内存的同一内存地址进行读改写操作后,软件最终能得到正确的执行结果,提出了各类存储一致性模型(memory consistency model),即多个CPU需遵从一定的访问顺序规则去读改写共享内存,可获得正确的执行结果,反之,则执行结果的正确性不受保证。由于不同的存储一致性模型所定义的读写规则不同,CPU会乱序执行没有依赖关系的指令以实现更多性能,多个线程间也允许交织运行以提高吞吐量,为了保证多线程交织运行过程中读改写操作的执行顺序,提出了同步机制(synchronizat ion)。在同步机制中,对一个共享变量进行读改写操作的原子访问行为是通过原子操作(atomic operation)完成的,对于实现一系列指令的原子访问行为是通过用锁(l ock)和临界区(critical section)的方式完成的。不论是原子操作还是锁和临界区,在硬件底层逻辑中,都是通过原子指令(atomic instruction)去完成对共享变量的读改写操作。Modern computer systems and multi-core chips support shared memory (shared memory) in hardware, that is, the shared memory can be accessed by multiple central processing units (Central Processing Unit, CPU) as a medium for sharing and transferring data between software processes , can improve the communication efficiency between processes. In order to ensure that after multiple CPUs read, modify and write the same memory address of the shared memory, the software can finally get the correct execution result, and various memory consistency models (memory consistency models) are proposed, that is, multiple CPUs need to follow certain access rules. Sequence rules to read and rewrite shared memory can obtain correct execution results, otherwise, the correctness of execution results is not guaranteed. Due to the different read and write rules defined by different storage consistency models, the CPU will execute instructions without dependencies in order to achieve more performance, and multiple threads are also allowed to interleave to improve throughput. In order to ensure multi-thread interleaved operation The execution sequence of read, modify and write operations in the process proposes a synchronization mechanism (synchronization). In the synchronization mechanism, the atomic access behavior of reading, modifying and writing operations on a shared variable is completed through atomic operations, and the atomic access behavior for implementing a series of instructions is achieved by using locks and critical sections. section) is completed. Whether it is an atomic operation or a lock and a critical section, in the underlying logic of the hardware, the read and write operation of the shared variable is completed through the atomic instruction (atomic instruction).
目前,计算机系统和多核芯片会在内存(memory)层级中增加高速缓冲存储器(cache),又称缓存,来减少访问内存的时延。通常,系统中会引入多层级的缓存,即缓存层级化(cache hierarchy)或者多层级缓存(multi-level cache)。为了保证同一时刻不同CPU访问同一缓存行(cache line)地址时读到的数据的正确性,多个CPU在同一时刻对同一个共享变量进行读改写操作时需要支持缓存一致性(cache coh erence)。该缓存一致性通常基于修改-独占-共享-无效(Modified-Exclusive-Share d-Invalid,MESI)一致性协议(coherence protocol)实现。在MESI一致性协议中,CPU对一个共享变量进行改写需要先获取该共享变量的独占(Exclusive,E)状态,也就是有该共享变量的物主权(ownership)改写的权限,即该共享变量所在的共享内存地址的运算权限,因此CPU通过原子指令完成对共享变量的读改写操作时也需要先获取该共享变量的E状态。但是,当多个CPU同时对同一个共享变量进行读改写时,就会出现E状态被多个CPU争抢导致E状态频繁迁移的情况,即物主权迁移(ownership migration)。而ownership migration会造成很大的系统开销(overhead),导致多核CPU执行原子指令吞吐量差。At present, computer systems and multi-core chips will add cache memory (cache), also known as cache, in the memory (memory) level to reduce the time delay of accessing memory. Usually, a multi-level cache is introduced into the system, that is, cache hierarchy or multi-level cache. In order to ensure the correctness of the data read when different CPUs access the same cache line (cache line) address at the same time, multiple CPUs need to support cache coherence when they read, modify and write the same shared variable at the same time . The cache consistency is usually implemented based on a modified-exclusive-shared-invalid (Modified-Exclusive-Shared-Invalid, MESI) coherence protocol. In the MESI consensus protocol, to rewrite a shared variable, the CPU needs to first obtain the exclusive (Exclusive, E) state of the shared variable, that is, to have the permission to rewrite the ownership of the shared variable, that is, where the shared variable is located. Therefore, the CPU also needs to obtain the E state of the shared variable when it completes the read, modify, and write operations of the shared variable through atomic instructions. However, when multiple CPUs read and rewrite the same shared variable at the same time, the E state will be frequently migrated due to contention by multiple CPUs, that is, ownership migration. However, ownership migration will cause a large system overhead (overhead), resulting in poor throughput of multi-core CPU execution of atomic instructions.
发明内容Contents of the invention
本申请实施例提供一种访问数据的方法和设备,可以在多层级缓存系统架构中,提高多核CPU执行原子操作的吞吐量。Embodiments of the present application provide a method and device for accessing data, which can improve the throughput of atomic operations performed by multi-core CPUs in a multi-level cache system architecture.
为达到上述目的,本申请实施例采用如下技术方案:In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:
第一方面,本申请实施例提供了一种访问数据的方法,该方法包括:第一节点接收第一缓存层的多个缓存节点发送的多个第一读请求,多个第一读请求均用于请求第一地址的运算权限;第一节点用于管理多个缓存节点的一致性;第一节点按照多个缓存节点分别发送的第一个第一读请求的顺序确定多个缓存节点获取运算权限的顺序;第一节点获取到运算权限时,按照多个缓存节点获取运算权限的顺序,控制运算权限在多个缓存节点间转移。In the first aspect, the embodiment of the present application provides a method for accessing data, the method includes: the first node receives multiple first read requests sent by multiple cache nodes in the first cache layer, and the multiple first read requests are all It is used to request the operation authority of the first address; the first node is used to manage the consistency of multiple cache nodes; the first node determines that multiple cache nodes obtain The order of computing permissions: when the first node obtains computing permissions, it controls the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions.
由此,当有多个CPU同时需要对同一地址进行读改写操作时,通过第一节点将改写该地址中数据的运算权限自下而上层级(第一节点接收到多个缓存节点的第一读请求,第一节点可以认为是第一缓存层的下一层级。)进行调度转移管理,使该运算权限在缓存节点间转移,从而可以在CPU核内部缓存层级进行近原子(near atomic)运算,减少CPU完成原子运算的时延。也可以避免现有技术中原子指令在公共交织节点排队造成入口队列拥塞,降低多核CPU抢锁带来的原子操作的冲突率与系统开销,提升原子操作的吞吐量。Thus, when multiple CPUs need to perform read and write operations on the same address at the same time, the operation authority to rewrite the data in the address will be hierarchical from bottom to top through the first node (the first node receives the first For read requests, the first node can be considered as the next level of the first cache layer.) Scheduling and transfer management is performed to transfer the computing authority between cache nodes, so that near atomic (near atomic) operations can be performed at the internal cache level of the CPU core , to reduce the delay of the CPU to complete the atomic operation. It can also avoid the entry queue congestion caused by atomic instructions queuing in public interleaving nodes in the prior art, reduce the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improve the throughput of atomic operations.
在一种可能的设计中,第一节点获取到运算权限时,按照多个缓存节点分别发送的第一个第一读请求的顺序,控制运算权限在多个缓存节点间转移包括:第一节点获取到运算权限时,若第一缓存节点为多个缓存节点中第一个获取运算权限的节点,则第一节点向第一缓存节点发送运算权限;第一节点从第一缓存节点获取第一数据,第一数据为第一缓存节点对第一地址的运算结果,并向第二缓存节点发送第一数据和运算权限,第二缓存节点为多个缓存节点中第二个获取运算权限的节点。In a possible design, when the first node obtains the computing permission, according to the order of the first first read request sent by the multiple cache nodes respectively, controlling the transfer of the computing permission among the multiple cache nodes includes: the first node When the computing permission is obtained, if the first cache node is the first node to obtain the computing permission among the multiple cache nodes, the first node sends the computing permission to the first cache node; the first node obtains the first cache node from the first cache node. Data, the first data is the operation result of the first address by the first cache node, and sends the first data and operation authority to the second cache node, and the second cache node is the second node that obtains the operation authority among multiple cache nodes .
由此,当第一缓存节点在得到对第一地址的运算权限时,可以在对第一缓存节点需要处理的第一读请求进行处理而不受其他缓存节点的干扰,其他缓存节点也不需要执行抢锁操作。当第一缓存节点对应的CPU在第一缓存节点执行第一地址的运算得到原酸结果时,可以将第一缓存节点对第一地址的运算结果和运算权限发送给第二缓存节点,以便第二缓存节点的CPU可以对第一地址再进行运算。这样,可以保证每个缓存节点获取到的数据都是对第一地址进行运算的最新的运算结果,用于保证缓存一致性。As a result, when the first cache node obtains the operation authority for the first address, it can process the first read request that the first cache node needs to process without interference from other cache nodes, and other cache nodes do not need Perform a lock grab operation. When the CPU corresponding to the first cache node executes the operation of the first address on the first cache node to obtain the original result, it can send the operation result and the operation authority of the first address to the second cache node, so that the second The CPU of the second cache node may perform operations on the first address. In this way, it can be guaranteed that the data acquired by each cache node is the latest operation result of the operation on the first address, which is used to ensure cache consistency.
在一种可能的设计中,第一节点从第一缓存节点获取第一数据包括:第一节点确定第一缓存节点获取运算权限的时间到达第一时间段时,从第一缓存节点获取第一数据;其中,第一时间段是第一节点根据第一节点所在的层级确定的。In a possible design, obtaining the first data from the first cache node by the first node includes: when the first node determines that the time when the first cache node obtains the computing authority reaches the first time period, obtains the first data from the first cache node. Data; wherein, the first time period is determined by the first node according to the level at which the first node is located.
由此,设置第一时间段可以保证每个缓存节点获取到运算权限后使用该运算权限能够处理多条第一读请求,避免出现运算权限被争抢导致频繁迁移的情况,同时也能够保证多个缓存节点对第一地址进行运算的公平性。Therefore, setting the first time period can ensure that each cache node can process multiple first read requests by using the computing permission after obtaining the computing permission, avoiding the frequent migration caused by the competition for computing permission, and at the same time ensuring multiple The fairness of operations performed by each cache node on the first address.
在一种可能的设计中,第一缓存层为一级L1缓存层,多个缓存节点分别对应多个中央处理器CPU的专有缓存;第一节点为二级L2缓存层的节点,第一节点为多个缓存节点对应的多个CPU的共享缓存;多个缓存节点和第一节点属于同一非一致内存访问NUMA域。这样,对于同一个NUMA域内的节点,第一节点决定运算权限在第一缓存层的多个缓存节点间的转移,可以保证该多个缓存节点的缓存一致性。In a possible design, the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs; the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain. In this way, for nodes in the same NUMA domain, the first node decides the transfer of computing authority among multiple cache nodes in the first cache layer, which can ensure the cache consistency of the multiple cache nodes.
在一种可能的设计中,第一节点获取到运算权限包括:第一节点向第二节点发送第二读请求,第二读请求用于请求第一地址的运算权限;第二节点用于管理多个第一 节点的一致性,第二节点和第二节点管理的多个第一节点属于同一die;第一节点接收第二节点发送的第一读响应,第一读响应包括运算权限。这样,对于同一die内的节点,第二节点决定运算权限在多个第一节点间的转移,可以保证该多个第一节点的缓存一致性。In a possible design, obtaining the operation authority by the first node includes: the first node sends a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage Consistency of multiple first nodes, the second node and the multiple first nodes managed by the second node belong to the same die; the first node receives the first read response sent by the second node, and the first read response includes computing authority. In this way, for the nodes in the same die, the second node decides the transfer of computing rights between multiple first nodes, which can ensure the cache consistency of the multiple first nodes.
在一种可能的设计中,第二节点确定第一节点获取运算权限的时间到达第二时间段时,从第一节点获取第二数据,第二数据为第一节点得到的第一地址的最新运算结果,第二节点向第三节点发送第二数据和运算权限,第三节点为与第一节点在同一缓存层且属于同一die的节点。可以理解,第二节点是第一节点和第三节点的一致性节点。第一节点和第三节点不需要执行对第一地址运行运算的抢锁操作,由第一节点控制运算权限的转移。In a possible design, the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node. As a result of the calculation, the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die. It can be understood that the second node is a consistent node between the first node and the third node. The first node and the third node do not need to perform a lock-grabbing operation on the first address, and the first node controls the transfer of the operation authority.
在一种可能的设计中,第一缓存层为二级L2缓存层,多个缓存节点分别为多个CPU的共享缓存;第一节点为缓存的本地代理,用于对内存进行读写操作。例如,第一节点和第三节点为L2缓存层的L2_0节点和L2_1节点,第二节点为home agent。当home agent有对第一地址进行运算的运算权限时,可以控制运算权限在L2_0节点和L2_1节点间转移,保证L2缓存层的一致性。In a possible design, the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs; the first node is a local agent of the cache, and is used for reading and writing operations on the memory. For example, the first node and the third node are the L2_0 node and the L2_1 node of the L2 cache layer, and the second node is the home agent. When the home agent has the operation authority to operate on the first address, it can control the operation authority to be transferred between the L2_0 node and the L2_1 node to ensure the consistency of the L2 cache layer.
第二方面,本申请实施例提供了一种通信装置,通信装置包括第一节点和第一缓存层的多个缓存节点,第一节点用于:接收多个缓存节点发送的多个第一读请求,多个第一读请求均用于请求第一地址的运算权限;第一节点用于管理多个缓存节点的一致性;按照多个缓存节点分别发送的第一个第一读请求的顺序确定多个缓存节点获取运算权限的顺序;获取到运算权限时,按照多个缓存节点获取运算权限的顺序,控制运算权限在多个缓存节点间转移。第二方面所达到的有益效果可以参见第一方面中有益效果。In the second aspect, the embodiment of the present application provides a communication device, the communication device includes a first node and multiple cache nodes in the first cache layer, and the first node is used to: receive multiple first reads sent by multiple cache nodes Request, multiple first read requests are used to request the computing authority of the first address; the first node is used to manage the consistency of multiple cache nodes; according to the order of the first first read requests sent by multiple cache nodes Determine the order in which multiple cache nodes obtain computing permissions; when computing permissions are obtained, control the transfer of computing permissions between multiple cache nodes according to the order in which multiple cache nodes obtain computing permissions. The beneficial effects achieved in the second aspect can refer to the beneficial effects in the first aspect.
在一种可能的设计中,第一节点具体用于:获取到运算权限时,若第一缓存节点为多个缓存节点中第一个获取运算权限的节点,则向第一缓存节点发送运算权限;从第一缓存节点获取第一数据,第一数据为第一缓存节点对第一地址的运算结果,并向第二缓存节点发送第一数据和运算权限,第二缓存节点为多个缓存节点中第二个获取运算权限的节点。In a possible design, the first node is specifically used to: when obtaining the computing permission, if the first cache node is the first node to obtain the computing permission among multiple cache nodes, send the computing permission to the first cache node ; Obtain the first data from the first cache node, the first data is the operation result of the first address by the first cache node, and send the first data and operation authority to the second cache node, the second cache node is a plurality of cache nodes The second node that obtains the computing authority in .
在一种可能的设计中,第一节点具体用于:确定第一缓存节点获取运算权限的时间到达第一时间段时,从第一缓存节点获取第一数据;其中,第一时间段是第一节点根据第一节点所在的层级确定的。In a possible design, the first node is specifically configured to: obtain the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches the first time period; wherein, the first time period is the first time period A node is determined according to the level of the first node.
在一种可能的设计中,第一缓存层为一级L1缓存层,多个缓存节点分别对应多个中央处理器CPU的专有缓存;第一节点为二级L2缓存层的节点,第一节点为多个缓存节点对应的多个CPU的共享缓存;多个缓存节点和第一节点属于同一非一致内存访问NUMA域。In a possible design, the first cache layer is a first-level L1 cache layer, and multiple cache nodes correspond to the dedicated caches of multiple CPUs; the first node is a node of the second-level L2 cache layer, and the first The node is a shared cache of multiple CPUs corresponding to multiple cache nodes; the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
在一种可能的设计中,第一节点具体用于:向第二节点发送第二读请求,第二读请求用于请求第一地址的运算权限;第二节点用于管理多个第一节点的一致性,第二节点和第二节点管理的多个第一节点属于同一die;接收第二节点发送的第一读响应,第一读响应包括所述运算权限。In a possible design, the first node is specifically used to: send a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage multiple first nodes consistency, the second node and multiple first nodes managed by the second node belong to the same die; receive the first read response sent by the second node, and the first read response includes the computing authority.
在一种可能的设计中,第二节点确定第一节点获取运算权限的时间到达第二时间 段时,从第一节点获取第二数据,第二数据为第一节点得到的第一地址的最新运算结果,第二节点向第三节点发送第二数据和运算权限,第三节点为与第一节点在同一缓存层且属于同一die的节点。In a possible design, the second node obtains the second data from the first node when it is determined that the time when the first node obtains the computing authority reaches the second time period, and the second data is the latest update of the first address obtained by the first node. As a result of the calculation, the second node sends the second data and the calculation authority to the third node, and the third node is a node in the same cache layer as the first node and belongs to the same die.
在一种可能的设计中,第一缓存层为二级L2缓存层,多个缓存节点分别为多个CPU的共享缓存;第一节点为缓存的本地代理,用于对内存进行读写操作。In a possible design, the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs; the first node is a local agent of the cache, and is used for reading and writing operations on the memory.
第三方面,一种计算机可读存储介质,包括计算机指令,当计算机指令在通信装置上运行时,使得通信装置执行上述第一方面以及第一方面中的任一种可能的设计所述的方法。In the third aspect, a computer-readable storage medium includes computer instructions, and when the computer instructions are run on the communication device, the communication device executes the method described in the first aspect and any possible design of the first aspect .
第四方面,一种计算机程序产品,当计算机程序产品在计算机上运行时,使得通信装置执行上述第一方面以及第一方面中的任一种可能的设计所述的方法。In a fourth aspect, a computer program product, when the computer program product is run on a computer, enables a communication device to execute the method described in the first aspect and any possible design of the first aspect.
上述其他方面对应的有益效果,可以参见关于方法方面的有益效果的描述,此处不予赘述。For the beneficial effects corresponding to the above other aspects, refer to the description of the beneficial effects of the method, which will not be repeated here.
附图说明Description of drawings
图1为现有技术中一种访问数据的方法示意图;FIG. 1 is a schematic diagram of a method for accessing data in the prior art;
图2为本申请实施例提供的系统架构示意图;FIG. 2 is a schematic diagram of a system architecture provided in an embodiment of the present application;
图3为本申请实施例提供的通信装置的硬件结构示意图;FIG. 3 is a schematic diagram of a hardware structure of a communication device provided by an embodiment of the present application;
图4为本申请实施例提供的访问数据的方法的流程示意图;FIG. 4 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application;
图5为本申请实施例提供的访问数据的方法的流程示意图;FIG. 5 is a schematic flowchart of a method for accessing data provided by an embodiment of the present application;
图6A为本申请实施例提供的通信装置的结构示意图;FIG. 6A is a schematic structural diagram of a communication device provided by an embodiment of the present application;
图6B为本申请实施例提供的通信装置的结构示意图。FIG. 6B is a schematic structural diagram of a communication device provided by an embodiment of the present application.
具体实施方式Detailed ways
为了便于理解,示例性地给出了部分与本申请实施例相关概念的说明以供参考。如下所示:In order to facilitate understanding, descriptions of some concepts related to the embodiments of the present application are provided by way of example for reference. As follows:
原子指令:用于同步机制中对临界区的保护,完成对共享变量的读改写操作。Atomic instructions: Used to protect the critical area in the synchronization mechanism and complete the read, modify and write operations of shared variables.
原子操作:指不可中断的一个或一系列操作。在单核CPU中,能够在一个指令中完成的操作可以看作为原子操作。原子操作不会被交错,一旦开始就运行到结束,不会切换到另一个线程。Atomic operation: Refers to one or a series of operations that cannot be interrupted. In a single-core CPU, operations that can be performed in one instruction can be regarded as atomic operations. Atomic operations are not interleaved, run to completion once started, and do not switch to another thread.
锁:多线程访问共享内存,为了保证共享内存互斥访问,需要给内存加锁,拥有该共享内存锁的线程才能访问该共享内存。可以理解为,多个CPU发送访问同一地址的指令时,需要执行抢锁操作,抢到锁的CPU可以执行对该地址的运算操作。Lock: Multiple threads access shared memory. In order to ensure mutually exclusive access to shared memory, it is necessary to lock the memory. Only threads that own the shared memory lock can access the shared memory. It can be understood that when multiple CPUs send instructions to access the same address, they need to perform a lock grab operation, and the CPU that grabs the lock can perform operations on the address.
临界区:共享内存无法同时被多个线程进行访问,当有线程进入临界区时,其他线程或进程必须等待。Critical section: Shared memory cannot be accessed by multiple threads at the same time. When a thread enters the critical section, other threads or processes must wait.
MESI一致性协议:MESI中每个缓存行都有四个状态,分别是修改(Modify,M)状态、独占(Exclusive,E)状态、共享(Share,S)状态以及无效(Invalid,I)状态。MESI consistency protocol: Each cache line in MESI has four states, which are modify (Modify, M) state, exclusive (Exclusive, E) state, shared (Share, S) state and invalid (Invalid, I) state .
M状态是指该缓存行中的数据(即本申请中的变量)已经被修改,且该数据与主存中的数据不一致,以当前缓存行中的数据为准。该缓存行中的数据需要在未来的某个时间点(在允许其它CPU读取主存中相应数据之前)写回(write back)主存。当被写回主存之后,该缓存行的状态会变成E状态。The M state means that the data in the cache line (that is, the variable in this application) has been modified, and the data is inconsistent with the data in the main memory, and the data in the current cache line shall prevail. The data in that cache line needs to be written back to main memory at some point in the future (before other CPUs are allowed to read the corresponding data in main memory). After being written back to main memory, the state of the cache line will change to E state.
E状态是指该缓存行中的数据与主存中的数据一致,且该数据仅存在于该CPU的 缓存中,即本层级缓存对应的处理器核独占该数据,是未被修改过的(clean)。该状态可以在任何时刻当有其它CPU读取该缓存行时变成S状态,当有CPU修改该缓存行中数据时,该状态会变成M状态。The E state means that the data in the cache line is consistent with the data in the main memory, and the data only exists in the cache of the CPU, that is, the processor core corresponding to the cache at this level exclusively occupies the data and has not been modified ( clean). This state can change to S state at any time when other CPUs read the cache line, and change to M state when a CPU modifies the data in the cache line.
S状态是指该缓存行中的数据与主存中的数据一致,且该数据存在于多个缓存行中,即多个处理器核共享该数据。当有CPU修改该缓存行中数据时,其它CPU中该缓存行被作废,会变成无效I状态。The S state means that the data in the cache line is consistent with the data in the main memory, and the data exists in multiple cache lines, that is, multiple processor cores share the data. When a CPU modifies the data in the cache line, the cache line in other CPUs is invalidated and becomes invalid I state.
I状态是指缓存行中的数据为不可用的无效数据(可能有其它CPU修改了该缓存行)。The I state means that the data in the cache line is unavailable invalid data (the cache line may be modified by other CPUs).
非一致内存访问(Non-Unified Memory Access,NUMA):一部分存储区域管理一部分地址,本申请实施例中,一个簇(cluster)可以理解为一个NUMA域。Non-Unified Memory Access (NUMA): A part of the storage area manages a part of addresses. In the embodiment of this application, a cluster (cluster) can be understood as a NUMA domain.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Among them, in the description of the embodiments of this application, unless otherwise specified, "/" means or means, for example, A/B can mean A or B; "and/or" in this article is only a description of associated objects The association relationship of indicates that there may be three kinds of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently. In addition, in the description of the embodiments of the present application, "plurality" refers to two or more than two.
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。Hereinafter, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, "plurality" means two or more.
CPU在不同缓存层级完成原子指令的时延不同,可以称在CPU内部缓存完成的原子指令为近原子的原子指令(near atomic),称在SOC侧缓存完成的原子指令为远原子的原子指令(far atomic),根据对多层级缓存的定义,CPU内部缓存例如可以包括第一层级缓存(Level 1 Cache,L1 Cache)时,SoC侧缓存例如可以包括第二层级缓存(Level 2 Cache,L2 Cache)或包括L2 Cache和第三层级缓存(Level 3 Cache,L3 Cache);或者CPU内部缓存例如包括L1 Cache和L2 Cache时,SoC侧缓存例如可以包括L3 Cache。可以理解,CPU完成far atomic的时延会比完成near atomic的时延更长。目前,为了避免出现多核CPU对E状态争抢的情况,如图1所示,可以是将所有多核CPU发送的原子指令经过多层级缓存都汇聚到一个公共交织节点,由该公共交织节点向本地代理(home agent)申请获取各个CPU待改写的变量的E状态,然后在公共交织节点按照调度顺序依次执行多核CPU发送的原子指令。即多核CPU的原子指令都是在公共交织节点以far atomic完成的,因此存在较大的时延。并且,当系统中存在多个CPU同时访问同一内存地址的原子指令的时,即多个CPU需要同一个E状态时,会有大量原子指令汇聚在公共交织节点从而造成入口队列拥塞。由于公共交织节点的缓存(buffer)所能容纳的原子指令的数量有限,当公共交织节点的buffer所容纳的原子指令的数量达到上限时,其余的原子指令不能存入buffer,因此CPU就需要再次重发该原子指令,这会造成公共交织节点进行通道反压,使得该原子指令的吞吐量下降,运算速度变慢,进而影响系统性能。The time delay for the CPU to complete atomic instructions at different cache levels is different. The atomic instructions that are cached and completed in the CPU can be called near atomic instructions (near atomic), and the atomic instructions that are cached and completed on the SOC side are called far atomic instructions ( far atomic), according to the definition of multi-level cache, CPU internal cache can include, for example, first-level cache (Level 1 Cache, L1 Cache), and SoC side cache can include, for example, second-level cache (Level 2 Cache, L2 Cache) Or include L2 Cache and third-level cache (Level 3 Cache, L3 Cache); or when the CPU internal cache includes L1 Cache and L2 Cache, for example, the SoC side cache can include L3 Cache, for example. It is understandable that the delay for the CPU to complete far atomic will be longer than the delay for completing near atomic. At present, in order to avoid the situation where multi-core CPUs compete for the E state, as shown in Figure 1, the atomic instructions sent by all multi-core CPUs can be aggregated to a public interleaving node through multi-level caches, and the public interleaving node sends local The agent (home agent) applies to obtain the E state of the variable to be rewritten by each CPU, and then executes the atomic instructions sent by the multi-core CPU in sequence in the common interleaving node according to the scheduling order. That is, the atomic instructions of the multi-core CPU are all completed by far atomic in the public interleaving node, so there is a large delay. Moreover, when multiple CPUs simultaneously access atomic instructions of the same memory address in the system, that is, when multiple CPUs need the same E state, a large number of atomic instructions will converge on the common interleaving node, resulting in entry queue congestion. Since the number of atomic instructions that can be accommodated in the cache (buffer) of the public interleaving node is limited, when the number of atomic instructions contained in the buffer of the public interleaving node reaches the upper limit, the rest of the atomic instructions cannot be stored in the buffer, so the CPU needs to re- Resending the atomic instruction will cause channel backpressure on the public interleaving node, which will reduce the throughput of the atomic instruction and slow down the operation speed, thereby affecting system performance.
原子指令又分为有条件的(conditional)原子指令和无条件的(non-conditional)原子指令。conditional原子指令在公共交织节点的memory进行原子运算之前需要先 进行条件判断,当判断有效时才进行原子运算,例如CAS算法中用的原子比较(atomic compare),只有当CPU发送的原子指令中的比较值(compare value)和从内存获取的内存值(memory value)相同(根据原子指令的地址读取的数据未被其他CPU修改)时,才进行原子运算。non-conditional原子指令是直接在公共交织节点的memory进行原子运算,无需先进行条件判断再执行原子运算,例如Atomic Add,Atomic Swap等指令。Atomic instructions are divided into conditional (conditional) atomic instructions and unconditional (non-conditional) atomic instructions. The conditional atomic instruction needs to be judged before the atomic operation is performed on the memory of the public interleaving node. The atomic operation is performed only when the judgment is valid. The atomic operation is only performed when the compare value is the same as the memory value obtained from the memory (the data read according to the address of the atomic instruction has not been modified by other CPUs). Non-conditional atomic instructions perform atomic operations directly on the memory of public interleaving nodes, without first performing conditional judgments before performing atomic operations, such as Atomic Add, Atomic Swap and other instructions.
在多核CPU在公共交织节点执行conditional原子指令的场景中,例如互斥锁(mutual exclusive lock)场景中,只有一个CPU能抢到锁,其余排队的原子指令实质上都是抢锁失败的。在这种场景下,这些排队的原子指令不仅会造成入口队列拥塞,而且由于抢锁失败,这些conditional原子指令在进行atomic compare时,大部分的原子指令都会比较失败从而不能完成原子运算。这些比较失败的原子指令不进行原子运算对系统运转不会造成影响,所以大部分比较失败的原子指令进行排队就是没有必要的,反而会导致系统整体吞吐量不高。In a scenario where a multi-core CPU executes conditional atomic instructions on a common interleaving node, such as a mutual exclusive lock scenario, only one CPU can grab the lock, and the rest of the queued atomic instructions essentially fail to grab the lock. In this scenario, these queued atomic instructions will not only cause entry queue congestion, but also fail to grab locks. When these conditional atomic instructions perform atomic compare, most of the atomic instructions will fail to complete the atomic operation. These failed atomic instructions will not affect the operation of the system if they do not perform atomic operations. Therefore, it is unnecessary to queue up most of the failed atomic instructions, which will lead to low overall system throughput.
可以理解,这种导致系统整体吞吐量不高的问题在多核CPU跨NUMA域进行原子操作时尤为严重,由于维护缓存一致性会产生ownership migration的问题,在这种场景下,ownership migration所造成的系统开销很大,即多会发生跨NUMA域抢锁的事件。尤其是临界区工作量越小时,每个CPU抢到锁的时间较短,在短时间内进行原子运算后又会被其他CPU抢到锁,因此这种频繁的ownership migration所造成的系统开销就越大。此外,当涉及跨die(裸片)跨片访问时系统开销也尤为严重,最终导致的结果是在多核CPU同时执行原子操作时,吞吐量较低。当前已有的软件技术是通过在同一NUMA域内进行抢锁来实现锁跨NUMA域轮转,相当于在位于同一NUMA域内的多个CPU之间进行抢锁,一段时间后再将锁转移到下一个NUMA域内,使下一个NUMA域内的多个CPU进行抢锁,从而实现锁跨NUMA域轮转的效果,但这依旧存在多个CPU抢锁的情况,导致ownership migration频繁、造成很大的系统开销,吞吐量较低。It is understandable that the problem of low overall system throughput is particularly serious when multi-core CPUs perform atomic operations across NUMA domains. The problem of ownership migration will occur due to the maintenance of cache consistency. In this scenario, the problem caused by ownership migration The system overhead is very high, that is, lock grabbing events across NUMA domains often occur. In particular, the smaller the workload of the critical section, the shorter the time for each CPU to grab the lock. After performing atomic operations in a short period of time, the lock will be grabbed by other CPUs. Therefore, the system overhead caused by this frequent ownership migration is bigger. In addition, the system overhead is particularly serious when cross-die (bare chip) cross-chip access is involved, and the final result is that when multi-core CPUs perform atomic operations at the same time, the throughput is low. The current existing software technology is to achieve lock rotation across NUMA domains by grabbing locks in the same NUMA domain, which is equivalent to grabbing locks between multiple CPUs in the same NUMA domain, and then transferring the lock to the next one after a period of time. In a NUMA domain, multiple CPUs in the next NUMA domain can grab locks, so as to achieve the effect of lock rotation across NUMA domains. However, there are still multiple CPUs grabbing locks, resulting in frequent ownership migration and a large system overhead. Throughput is lower.
因此,本申请提出一种访问数据的方法,该方法可以应用于一种通信装置,本申请中的通信装置可以理解为芯片,例如消费类芯片、工业类芯片等所有通用芯片。考虑到现有技术中多核CPU执行原子操作时进行抢锁造成的吞吐量低的问题,本申请在多层级cache系统架构中,多核CPU同时对同一地址执行原子操作从而发生原子操作冲突的场景中,通过一致性管理节点自下而上层级进行E态调度转移管理,CPU获取E状态后在CPU核内部缓存层级进行near atomic运算,来减少CPU完成原子运算的时延,可以避免原子指令在公共交织节点排队造成入口队列拥塞,降低多核CPU抢锁带来的原子操作的冲突率与系统开销,提升原子操作的吞吐量。Therefore, this application proposes a method for accessing data, which can be applied to a communication device. The communication device in this application can be understood as a chip, such as all general-purpose chips such as consumer chips and industrial chips. Considering the problem of low throughput caused by lock grabbing when multi-core CPUs perform atomic operations in the prior art, this application is in a multi-level cache system architecture where multi-core CPUs perform atomic operations on the same address at the same time, resulting in atomic operation conflicts , E-state scheduling transfer management is carried out from bottom to top through the consistency management node. After the CPU acquires the E-state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay of the CPU completing the atomic operation, which can avoid atomic instructions in public Interleaved node queuing causes entry queue congestion, reduces the conflict rate and system overhead of atomic operations caused by multi-core CPU lock grabbing, and improves the throughput of atomic operations.
如图2所示,本申请实施例可以应用于多层级cache系统架构中。cache是位于CPU与memory之间的临时存储器。通常,cache可以分为L1 Cache,L2 Cache,部分CPU还具有L3 Cache。当CPU要读取一个数据时,首先从L1 Cache中查找,如果没有查找到数据,再从L2 Cache中查找,如果继续未查找到数据,可以从L3 Cache或memory中查找。As shown in FIG. 2 , the embodiment of the present application can be applied to a multi-level cache system architecture. Cache is a temporary storage located between CPU and memory. Generally, cache can be divided into L1 Cache, L2 Cache, and some CPUs also have L3 Cache. When the CPU wants to read a piece of data, it first looks it up from the L1 Cache, if it does not find the data, then it looks it up from the L2 Cache, if it continues to find no data, it can look it up from the L3 Cache or memory.
图2所示为一个典型的多层级cache系统,共有2个cache层级,分别为L1 Cache和L2 Cache。L1_0、L1_1、L1_2和L1_3分别表示CPU0、CPU1、CPU2和CPU3的私有 的L1 Cache(private L1 Cache),L2_0表示cluster0内CPU可共享的L2 Cache(shared L2 Cache),每个cluster内的多个L1 Cache由同cluster内的下一层级L2 Cache管理一致性,即L1_0、L1_1、L1_2、L1_3和L2_0在同一个cluster0内,且L2_0管理L1_0、L1_1、L1_2和L1_3的一致性。每个cluster内的L2 Cache由同die内的下一级home agent管理一致性,例如L1_0、L1_1、L1_2、L1_3、L1_4、L1_5、L1_6、L1_7、L2_0、L2_1和home agent在同一个die内,且home agent管理L2_0、L2_1的一致性。因此对于cluster内的L1 Cache而言,它的一致性管理节点是同cluster内的L2 Cache,对于L2 Cache而言,它的一致性管理节点是同die内的home agent。Figure 2 shows a typical multi-level cache system with two cache levels, L1 Cache and L2 Cache. L1_0, L1_1, L1_2, and L1_3 represent the private L1 Cache (private L1 Cache) of CPU0, CPU1, CPU2, and CPU3 respectively. L2_0 represents the shared L2 Cache (shared L2 Cache) of CPUs in cluster0. Multiple caches in each cluster The consistency of L1 Cache is managed by the next level of L2 Cache in the same cluster, that is, L1_0, L1_1, L1_2, L1_3, and L2_0 are in the same cluster0, and L2_0 manages the consistency of L1_0, L1_1, L1_2, and L1_3. The L2 Cache in each cluster is managed by the next-level home agent in the same die. For example, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6, L1_7, L2_0, L2_1 and home agent are in the same die. And the home agent manages the consistency of L2_0 and L2_1. Therefore, for the L1 Cache in the cluster, its coherent management node is the L2 Cache in the same cluster, and for the L2 Cache, its coherent management node is the home agent in the same die.
本申请实施例可以应用于通信装置,如图3所示,其示出了一种通信装置的硬件结构示意图,该通信装置可以包括本申请实施例中的芯片,图3中以芯片300示例的芯片。芯片300可包括处理器301、内存控制器(memory controller)302以及多级缓存303等。The embodiment of the present application can be applied to a communication device, as shown in FIG. 3 , which shows a schematic diagram of the hardware structure of a communication device. The communication device can include the chip in the embodiment of the present application. In FIG. 3 , the chip 300 is exemplified. chip. The chip 300 may include a processor 301, a memory controller (memory controller) 302, a multi-level cache 303, and the like.
可以理解的是,本申请实施例示意的结构并不构成对芯片300的具体限定。在本申请另一些实施例中,芯片300可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that, the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the chip 300 . In other embodiments of the present application, the chip 300 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.
其中,处理器301可以包括一个或多个处理单元。例如:处理器301可以包括图形处理器(graphics processing unit,GPU)、中央处理器(central processing unit,CPU)、和/或神经网络处理器(neural network processing unit,NPU)等。其中,不同的处理单元可以是独立的部件,也可以集成在一个或多个处理器中。在一些实施例中,芯片300也可以包括一个或多个处理器301,多个处理器可以理解为多核CPU。Wherein, the processor 301 may include one or more processing units. For example: the processor 301 may include a graphics processing unit (graphics processing unit, GPU), a central processing unit (central processing unit, CPU), and/or a neural network processor (neural network processing unit, NPU), etc. Wherein, different processing units may be independent components, or may be integrated in one or more processors. In some embodiments, the chip 300 may also include one or more processors 301, and multiple processors may be understood as multi-core CPUs.
处理器301可以包括多级缓存303的一部分,用于存储指令和数据。这里的多级缓存303的一部分可以理解为CPU内部缓存。在一些实施例中,CPU内部缓存可以为高速缓冲存储器,例如上述L1 cache。该L1 cache可以保存处理器301最近使用过或循环使用的指令或数据,如果CPU需要再次使用该指令或数据,可从L1 cache中直接调用,减少了CPU的等待时间,提高了系统的效率。在本申请实施例中,CPU内部缓存还可以理解为L1 Cache和L2 Cache。即L1 Cache和L2 Cache为CPU核内部缓存层级,可以用于CPU进行near atomic运算。其中,L1 Cache为CPU内部私有的缓存层级,L2 Cache为CPU内部共享的缓存层级。Processor 301 may include a portion of multi-level cache 303 for storing instructions and data. Part of the multi-level cache 303 here can be understood as a CPU internal cache. In some embodiments, the internal cache of the CPU may be a cache memory, such as the above-mentioned L1 cache. The L1 cache can save the instructions or data recently used or recycled by the processor 301. If the CPU needs to use the instructions or data again, it can be directly called from the L1 cache, reducing the waiting time of the CPU and improving the efficiency of the system. In this embodiment of the application, the CPU internal cache can also be understood as L1 Cache and L2 Cache. That is, L1 Cache and L2 Cache are the internal cache levels of the CPU core, which can be used for near atomic operations on the CPU. Among them, L1 Cache is a private cache level inside the CPU, and L2 Cache is a shared cache level inside the CPU.
处理器301可以理解为是芯片300的神经中枢和指挥中心。可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The processor 301 can be understood as the nerve center and command center of the chip 300 . The operation control signal can be generated according to the instruction opcode and timing signal, and the control of fetching and executing instructions can be completed.
内存控制器302用于管理内存中的数据读写操作,内存控制器302还可以包括本地代理(home agent),可以用于实现对内存的读写操作。本申请实施例中,home agent可以用于负责芯片300的L2 cache的缓存一致性管理。可以理解,home agent位于CPU核外部,可以用于CPU进行far atomic运算。本申请实施例中,home agent还可以为CPU提供访问内存地址时的E状态。The memory controller 302 is used to manage data read and write operations in the memory, and the memory controller 302 may also include a home agent (home agent), which may be used to implement read and write operations to the memory. In the embodiment of the present application, the home agent can be used to be responsible for the cache coherence management of the L2 cache of the chip 300. It can be understood that the home agent is located outside the CPU core and can be used for far atomic operations on the CPU. In the embodiment of the present application, the home agent can also provide the CPU with the E state when accessing the memory address.
多级缓存303的其余部分可以理解为CPU外部缓存,即处于SoC上的缓存层级,例如L3 cache。The rest of the multi-level cache 303 can be understood as the CPU external cache, that is, the cache level on the SoC, such as L3 cache.
应用上述本申请提供的芯片,下面结合附图对本申请针对芯片所提出的访问数据 的方法中,在通信装置,例如芯片的多核CPU同时对同一内存地址执行原子操作的场景中,由一致性管理节点自下而上层级进行E状态调度管理的过程进行介绍。Applying the chip provided by the above-mentioned application, in the method of accessing data proposed by the application for the chip in conjunction with the accompanying drawings, in the scenario where the communication device, such as the multi-core CPU of the chip, performs atomic operations on the same memory address at the same time, the consistency management The process of node E-state scheduling management from bottom to top is introduced.
如图4所示,本申请实施例提供一种访问数据的方法,以多层级Cache系统架构中有2个Cache层级(L1 Cache和L2 Cache)和1个home agent层级为例,其中,L1 Cache和L2 Cache为CPU核内部缓存层级,该方法包括:As shown in Figure 4, the embodiment of the present application provides a method for accessing data, taking two Cache levels (L1 Cache and L2 Cache) and one home agent level in the multi-level Cache system architecture as an example, wherein, L1 Cache And L2 Cache is the internal cache level of the CPU core, the method includes:
步骤401、第一节点接收第一缓存层的多个缓存节点发送的多个第一读请求。 Step 401, the first node receives multiple first read requests sent by multiple cache nodes of the first cache layer.
在一些实施例中,多个第一读请求均用于请求第一地址的运算权限(E状态),任一节点只有获取了该第一地址的运算权限才能够对该第一地址中的数据进行改写操作。可以理解,改写同一内存地址的数据所需要的运算权限相同,改写不同内存地址的数据所需要的运算权限状态不同。In some embodiments, a plurality of first read requests are all used to request the operation authority (E state) of the first address, and any node can only obtain the operation authority of the first address to read the data in the first address Perform an overwrite operation. It can be understood that the operation authority required to rewrite data at the same memory address is the same, and the operation authority states required to rewrite data at different memory addresses are different.
本申请的第一地址被多个CPU访问,可以理解为第一地址被多个CPU执行读改写操作。The fact that the first address in this application is accessed by multiple CPUs can be understood as that the first address is read, modified and written by multiple CPUs.
本申请实施例中,第一节点用于管理第一缓存层的多个缓存节点的一致性,即第一节点可以控制运算权限在第一缓存层的多个缓存节点之间转移,一段时间内只有一个CPU能够获取该第一地址的运算权限,即一段时间内只允许一个CPU对第一地址中的数据在第一缓存层的多个缓存节点上进行改写操作,从而保证多个缓存节点之间的缓存一致性。In this embodiment of the present application, the first node is used to manage the consistency of multiple cache nodes in the first cache layer, that is, the first node can control the transfer of computing authority between multiple cache nodes in the first cache layer, within a period of time Only one CPU can obtain the operation authority of the first address, that is, only one CPU is allowed to rewrite the data in the first address on multiple cache nodes of the first cache layer within a period of time, thereby ensuring that multiple cache nodes cache coherency among them.
示例性的,第一节点可以为home agent或L2 Cache。Exemplarily, the first node may be a home agent or an L2 Cache.
当第一节点为home agent时,第一缓存层可以为L2 Cache,多个缓存节点可以分别对应图2中的L2_0和L2_1。在此场景下,步骤401可以理解为home agent接收多个L2 Cache发送的多个第一读请求。这种情况下,当多个L2 cache确定本地存储有第一地址的数据时,L2 cache向home agent发送的第一读请求用于请求对第一地址的运算权限,或者当L2 cache确定本地未存储第一地址的数据时,L2 cache向home agent发送的第一读请求用于请求第一地址的数据和对第一地址的运算权限。在L2 cache向home agent发送第一读请求之前,L2 cache还可能会接收到多个L1 cache发送的读请求,用于向L2 cache请求第一地址的数据和运算权限,之后L2 cache再向home agent发送第一读请求,以向home agent请求对第一地址的运算权限。应理解,L2_0或L2_1可以下发一个或多个读请求至home agent,均用于请求同一内存地址的运算权限。When the first node is a home agent, the first cache layer can be L2 Cache, and multiple cache nodes can correspond to L2_0 and L2_1 in Figure 2 respectively. In this scenario, step 401 can be understood as the home agent receiving multiple first read requests sent by multiple L2 Cache. In this case, when multiple L2 caches determine that the data of the first address is stored locally, the first read request sent by the L2 cache to the home agent is used to request the computing authority for the first address, or when the L2 cache determines that the data of the first address is not locally stored. When storing the data at the first address, the first read request sent by the L2 cache to the home agent is used to request the data at the first address and the computing authority to the first address. Before the L2 cache sends the first read request to the home agent, the L2 cache may also receive read requests sent by multiple L1 caches, which are used to request the data and computing permissions of the first address from the L2 cache, and then the L2 cache sends the home agent The agent sends a first read request to request the home agent for computing authority to the first address. It should be understood that L2_0 or L2_1 may send one or more read requests to the home agent, all of which are used to request the operation permission of the same memory address.
当第一节点为L2 Cache时,第一缓存层为L1 Cache,以第一节点为图2中的L2_0为例,多个缓存节点可以包括图2中的L1_0、L1_1、L1_2和L1_3。在此场景下,步骤401可以理解为L2 Cache(即L2_0)接收多个L1 Cache(即L1_0、L1_1、L1_2和L1_3)发送的多个第一读请求。这种情况下,多个L1 cache确定本地未存储有第一地址的数据(L1 cache初始态为无效态),L1 cache向L2 cache发送的第一读请求用于请求第一地址的数据和对第一地址的运算权限。在L1 cache向L2 cache发送第一读请求之前,L1 cache还会接收到CPU发送的读请求,用于向L1 cache请求第一地址的数据和运算权限,之后L1 cache再向L2 cache发送第一读请求,以向L2 cache请求对第一地址的数据和运算权限。应理解,L1_0、L1_1、L1_2和L1_3可以下发一个或多个第一读请求至L2 Cache,均用于请求同一内存地址的数据和运算权限。When the first node is L2 Cache, the first cache layer is L1 Cache, taking the first node as L2_0 in Figure 2 as an example, multiple cache nodes can include L1_0, L1_1, L1_2 and L1_3 in Figure 2. In this scenario, step 401 can be understood as the L2 Cache (that is, L2_0) receiving multiple first read requests sent by multiple L1 Cache (that is, L1_0, L1_1, L1_2, and L1_3). In this case, multiple L1 caches determine that the data of the first address is not stored locally (the initial state of the L1 cache is invalid), and the first read request sent by the L1 cache to the L2 cache is used to request the data of the first address and the The operation authority of the first address. Before the L1 cache sends the first read request to the L2 cache, the L1 cache will also receive the read request sent by the CPU to request the data and computing authority of the first address from the L1 cache, and then the L1 cache will send the first read request to the L2 cache. A read request to request the data and operation permission of the first address from the L2 cache. It should be understood that L1_0, L1_1, L1_2, and L1_3 may issue one or more first read requests to the L2 Cache, all of which are used to request data and operation permissions of the same memory address.
在一些实施例中,第一读请求还可以用于请求第一地址的数据以及运算权限。示例性的,当第一节点为home agent时,若L1 cache和L2 cache中未存储第一地址的数据,则home agent从L2 cache接收到的第一读请求用于同时请求第一地址的数据以及运算权限。若L2 cache中存储有第一地址的数据,则L2 cache可以不用再向home agent请求第一地址的数据,只请求第一地址的运算权限即可。In some embodiments, the first read request may also be used to request data and computing permissions of the first address. Exemplarily, when the first node is the home agent, if the data of the first address is not stored in the L1 cache and the L2 cache, the first read request received by the home agent from the L2 cache is used to simultaneously request the data of the first address and operational permissions. If the data of the first address is stored in the L2 cache, the L2 cache does not need to request the data of the first address from the home agent, but only requests the operation authority of the first address.
与第一节点为home agent时的区别在于,当第一节点为L2 Cache时,第一缓存层为L1 Cache,而L1 Cache的初始状态为I状态,可以理解为L1 Cache中的数据是无效的,不可用的。因此L1 Cache中的缓存节点可向L2 cache下发第一读请求,第一读请求均用于请求第一地址的数据以及运算权限。L2 cache接收到L1 Cache发送的第一读请求之后,再根据自身是否存储有第一地址的数据来判断向home agent发送的读请求为请求第一地址的数据以及运算权限的读请求还是只请求第一地址的运算权限即可。The difference from when the first node is the home agent is that when the first node is the L2 Cache, the first cache layer is the L1 Cache, and the initial state of the L1 Cache is the I state, which can be understood as the data in the L1 Cache is invalid ,unavailable. Therefore, the cache nodes in the L1 Cache can issue the first read request to the L2 cache, and the first read request is used to request the data and computing authority of the first address. After the L2 cache receives the first read request sent by the L1 Cache, it judges whether the read request sent to the home agent is a request for the data of the first address and a read request for computing authority or only according to whether it stores the data of the first address. The operation authority of the first address is sufficient.
步骤402、第一节点按照多个缓存节点分别发送的第一个第一读请求的顺序确定多个缓存节点获取运算权限的顺序。In step 402, the first node determines the order in which the multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by the multiple cache nodes.
其中,每个缓存节点可以向第一节点发送一个或多个第一读请求,按照多个缓存节点分别发送的第一个第一读请求的顺序确定多个缓存节点获取运算权限的顺序可以理解为,将多个缓存节点按照其发送第一个第一读请求的时间进行排队。Wherein, each cache node can send one or more first read requests to the first node, and it is understandable to determine the order in which multiple cache nodes obtain computing permissions according to the order of the first first read requests sent by multiple cache nodes respectively For, multiple cache nodes are queued according to the time when they send the first first read request.
示例性的,当第一节点为L2 Cache时,以第一节点为图2中的L2_0为例,多个缓存节点可以分别对应图2中的L1_0、L1_1、L1_2和L1_3。假设L1_0先向L2_0发送了读请求1,之后L1_2向L2_0发送了读请求2,然后L1_0又向L2_0发送了读请求3,最后L1_1向L2_0发送了读请求4,则L1_0发送了两个读请求,L1_0按照发送第一个读请求的时间进行排队,即L1_0按照发送读请求1的时间进行排队。那么L2_0确定这3个缓存节点获取运算权限的顺序为L1_0-L1_2-L1_1。由于L1_3没有发送读请求,因此L1_3不在该顺序中。Exemplarily, when the first node is an L2 Cache, taking the first node as L2_0 in FIG. 2 as an example, multiple cache nodes may respectively correspond to L1_0, L1_1, L1_2 and L1_3 in FIG. 2 . Suppose L1_0 first sends read request 1 to L2_0, then L1_2 sends read request 2 to L2_0, then L1_0 sends read request 3 to L2_0, and finally L1_1 sends read request 4 to L2_0, then L1_0 sends two read requests , L1_0 queues according to the time of sending the first read request, that is, L1_0 queues according to the time of sending read request 1. Then L2_0 determines that the order in which these three cache nodes obtain computing permissions is L1_0-L1_2-L1_1. Since L1_3 did not send a read request, L1_3 is not in the order.
同理,当第一节点为home agent时,home agent获取到运算权限时,按照L2 Cache的多个缓存节点分别发送的第一个第一读请求的顺序,确定L2 Cache的多个缓存节点获取所述运算权限的顺序。Similarly, when the first node is the home agent, when the home agent obtains the computing authority, determine the multiple cache nodes of the L2 Cache according to the order of the first first read request sent by the multiple cache nodes of the L2 Cache The order of the computing permissions.
步骤403、第一节点获取到运算权限时,按照多个缓存节点获取所述运算权限的顺序,控制运算权限在多个缓存节点间转移。 Step 403 , when the first node obtains the computing permission, control the transfer of computing permission among multiple cache nodes according to the order in which the multiple cache nodes obtain the computing permission.
示例性的,当第一节点为L2 Cache时,L2 Cache在获取到运算权限时,会按照步骤402中确定的L1 Cache的多个缓存节点获取运算权限的顺序,控制运算权限在L1 Cache的多个缓存节点间进行转移。同理,当第一节点为home agent时,home agent在获取到运算权限时,会按照步骤402中确定的L2 Cache的多个缓存节点获取运算权限的顺序,控制运算权限在L2 Cache的多个缓存节点间进行转移。Exemplarily, when the first node is the L2 Cache, when the L2 Cache obtains the operation authority, it will control the operation authority in the order in which multiple cache nodes of the L1 Cache obtain the operation authority determined in step 402. transfer between cache nodes. Similarly, when the first node is the home agent, when the home agent obtains the operation authority, it will control the operation authority in the multiple cache nodes of the L2 Cache in the order in which the operation authority is determined in step 402. Transfer between cache nodes.
在一些实施例中,第一节点(例如图2中的L2_0)获取到运算权限时,向第一缓存节点(例如图2中的L1_0)发送运算权限,第一缓存节点获取到运算权限之后,第一缓存节点对应的CPU使用该运算权限在第一缓存节点中进行运算。In some embodiments, when the first node (such as L2_0 in FIG. 2 ) obtains the computing permission, it sends the computing permission to the first cache node (such as L1_0 in FIG. 2 ), and after the first cache node obtains the computing permission, The CPU corresponding to the first cache node uses the computing authority to perform computation in the first cache node.
其中,第一缓存节点可以理解为按照多个缓存节点分别发送的第一个第一读请求的顺序排列的第一个缓存节点,相当于多个缓存节点中第一个获取运算权限的节点。Wherein, the first cache node can be understood as the first cache node arranged in the order of the first first read requests sent by the multiple cache nodes, which is equivalent to the first node that obtains the computing authority among the multiple cache nodes.
示例性的,第一节点获取到运算权限时,向第一缓存节点发送运算权限,第一缓存节点会按其向第一节点发送的第一读请求的顺序对一个或多个第一读请求进行处理,对第一读请求进行处理可以理解为对第一地址中的数据进行改写操作,即完成原子运算。对第一地址中的数据完成改写操作后会产生新的第一地址中的数据,即第一缓存节点对第一地址的运算结果。Exemplarily, when the first node obtains the computing permission, it sends the computing permission to the first cache node, and the first cache node will process one or more first read requests in the order of the first read requests sent to the first node For processing, processing the first read request can be understood as rewriting the data in the first address, that is, completing the atomic operation. After the rewriting operation is completed on the data in the first address, new data in the first address will be generated, that is, the operation result of the first cache node on the first address.
在一些实施例中,第一节点从第一缓存节点获取第一数据,并向第二缓存节点(例如图2中的L1_1)发送第一数据和运算权限。In some embodiments, the first node obtains the first data from the first cache node, and sends the first data and the computing permission to the second cache node (for example, L1_1 in FIG. 2 ).
其中,第一数据为第一缓存节点对第一地址的运算结果,该运算结果可以理解为是第一缓存节点处理了一个或多个第一读请求之后得到的最新运算结果。第二缓存节点为多个缓存节点中第二个获取运算权限的节点。Wherein, the first data is an operation result of the first address by the first cache node, and the operation result can be understood as the latest operation result obtained after the first cache node processes one or more first read requests. The second cache node is the second node that obtains computing authority among the multiple cache nodes.
示例性的,第一节点从第一缓存节点处获取到第一缓存节点处理了一个或多个第一读请求后的最新运算结果,即第一数据,并向按照多个缓存节点分别发送的第一个第一读请求的顺序排列的第二个缓存节点发送第一数据和运算权限,即向第二缓存节点发送第一数据和运算权限。第二缓存节点获取到第一地址的最新运算结果和运算权限后,同样会按其向第一节点发送的第一读请求的顺序对一个或多个第一读请求进行处理,对第一地址的最新运算结果进行改写操作,更新第一地址的最新运算结果。Exemplarily, the first node obtains from the first cache node the latest operation result after the first cache node has processed one or more first read requests, that is, the first data, and sends them to the The second cache node in the order of the first first read request sends the first data and computing permission, that is, sends the first data and computing permission to the second cache node. After the second cache node obtains the latest operation result and operation authority of the first address, it will also process one or more first read requests in the order of the first read requests sent to the first node. The latest calculation result of the first address is rewritten to update the latest calculation result of the first address.
在一些实施例中,第一节点确定第一缓存节点获取运算权限的时间到达第一时间段时,从第一缓存节点获取第一数据。In some embodiments, the first node acquires the first data from the first cache node when determining that the time when the first cache node acquires the computing authority reaches a first time period.
示例性的,当第一缓存节点获取到运算权限的时间到达第一时间段时,即第一缓存节点获取到运算权限的时间到达时间限制时,第一节点就会从第一缓存节点获取第一缓存节点对第一地址的数据进行改写操作完成后的最新运算结果,即第一数据,并向第二缓存节点发送第一数据和运算权限,以实现运算权限转移。Exemplarily, when the time when the first cache node obtains the computing permission reaches the first time period, that is, when the time when the first cache node obtains the computing permission reaches the time limit, the first node will obtain the first cache node from the first cache node. A cache node rewrites the data at the first address to the latest operation result after the operation is completed, that is, the first data, and sends the first data and the operation authority to the second cache node, so as to realize the operation authority transfer.
设置第一时间段的目的是为了让运算权限转移到任一缓存节点处时可以多停留一段时间,以便于该缓存节点能够在获得运算权限后处理多条第一读请求,而不是只处理完一条第一读请求后就将运算权限转移出去,从而避免了运算权限不停地在该缓存节点处频繁迁移的情况。同时,设置第一时间段也是为了使每个发送第一读请求的缓存节点都能够获取到运算权限,保证缓存节点之间的公平性,因此需要对每个缓存节点获取到运算权限的时间进行限制。The purpose of setting the first time period is to allow the operation permission to be transferred to any cache node to stay for a longer period of time, so that the cache node can process multiple first read requests after obtaining the operation permission, instead of just processing After a first read request, the computing authority is transferred out, thereby avoiding the frequent migration of computing authority at the cache node. At the same time, setting the first time period is also to enable each cache node that sends the first read request to obtain computing authority and ensure fairness among cache nodes. limit.
由于不同层级所管理的一致性范围不同,即所管理的CPU个数不同,例如图2所示的L2_0和L2_1分别管理4个CPU的一致性,而home agent管理L2_0和L2_1的一致性,包含8个CPU,因此运算权限在home agent处的延时(delay)需要大于运算权限在L2_0或L2_1处的delay,即运算权限在home agent处的停留时间需要大于运算权限在L2_0或L2_1处的停留时间,才能够保证L2 Cache获取到运算权限后进行正常运算。相当于做到按需delay,如果统一按照CPU最多的数量去delay则会损耗系统的动态性能。Because the range of consistency managed by different levels is different, that is, the number of managed CPUs is different. For example, L2_0 and L2_1 shown in Figure 2 manage the consistency of 4 CPUs respectively, and the home agent manages the consistency of L2_0 and L2_1, including 8 CPUs, so the delay (delay) of the computing authority at the home agent needs to be greater than the delay of the computing authority at L2_0 or L2_1, that is, the residence time of the computing authority at the home agent needs to be longer than the stay of the computing authority at L2_0 or L2_1 Time can ensure that the L2 Cache can perform normal operations after obtaining the computing authority. It is equivalent to achieving on-demand delay. If the delay is unified according to the maximum number of CPUs, the dynamic performance of the system will be lost.
因此,第一时间段可以是第一节点根据第一节点所在的层级确定的。Therefore, the first time period may be determined by the first node according to the level at which the first node is located.
也可以理解为第一节点根据自身包含的CPU个数确定出要将运算权限在每个缓存节点中停留的时间。还可以理解为第一节点根据自身所在的层级以及上游发送第一读请求的缓存节点的个数共同自适应确定出要将运算权限在每个缓存节点中停留的时间。It can also be understood that the first node determines the time to keep the computing authority in each cache node according to the number of CPUs it contains. It can also be understood that the first node jointly and adaptively determines the time to keep the computing authority in each cache node according to its own level and the number of cache nodes that send the first read request upstream.
示例性的,当第一节点为home agent时,以图2为例,home agent可以根据自身所在的层级,即包含8个CPU,确定出控制运算权限在L2_0和L2_1中停留的时间,以保证运算权限在home agent处的停留时间内能够公平的处理这8个CPU下发的读请求。home agent还可以根据自身所在的层级,即包含8个CPU,以及上游发送第一读请求的缓存节点的个数,即L1 Cache中下发读请求的CPU个数,例如图2中示出了8个L1,即8个CPU,共同自适应确定出要将运算权限在L2_0和L2_1中停留的时间,以保证运算权限在home agent处的停留时间内能够公平的处理L1 Cache中下发读请求的多个CPU下发的读请求。Exemplarily, when the first node is a home agent, taking Figure 2 as an example, the home agent can determine the time that the control operation authority stays in L2_0 and L2_1 according to its own level, that is, including 8 CPUs, so as to ensure The computing authority can fairly process the read requests issued by the 8 CPUs during the residence time of the home agent. The home agent can also include 8 CPUs according to its own level, and the number of cache nodes that send the first read request upstream, that is, the number of CPUs that send read requests in the L1 Cache, for example, as shown in Figure 2 8 L1s, that is, 8 CPUs, jointly adaptively determine the time to stay in L2_0 and L2_1 for the computing authority, so as to ensure that the computing authority can fairly process the read request in the L1 Cache within the residence time of the home agent Read requests issued by multiple CPUs.
下面具体以第一节点为多个缓存节点对应的多个CPU的共享缓存,多个缓存节点分别对应多个CPU的专有缓存为例进行说明。可以理解为第一节点为二级L2缓存层(L2Cache)的节点,第一缓存层为一级L1缓存层(L1 Cache)。In the following, the first node is a shared cache of multiple CPUs corresponding to multiple cache nodes, and the multiple cache nodes respectively correspond to dedicated caches of multiple CPUs as an example for illustration. It can be understood that the first node is a node of the second-level L2 cache layer (L2Cache), and the first cache layer is a first-level L1 cache layer (L1 Cache).
其中,多个缓存节点和第一节点属于同一非一致内存访问NUMA域。也可以理解为多个缓存节点和第一节点属于同一cluster。Wherein, the multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain. It can also be understood that multiple cache nodes and the first node belong to the same cluster.
示例性的,以图2为例,第一节点为L2_0时,多个缓存节点分别为L1_0、L1_1、L1_2和L1_3,那么L1_0、L1_1、L1_2、L1_3和L2_0属于同一cluster0。第一节点为L2_1时,多个缓存节点分别为L1_4、L1_5、L1_6和L1_7,那么L1_4、L1_5、L1_6、L1_7和L2_1属于同一cluster1。Exemplarily, taking FIG. 2 as an example, when the first node is L2_0, and the multiple cache nodes are L1_0, L1_1, L1_2, and L1_3 respectively, then L1_0, L1_1, L1_2, L1_3, and L2_0 belong to the same cluster0. When the first node is L2_1, and the multiple cache nodes are L1_4, L1_5, L1_6, and L1_7, then L1_4, L1_5, L1_6, L1_7, and L2_1 belong to the same cluster1.
在一些实施例中,第一节点(例如L2_0)在接收到第一缓存层的多个缓存节点的第一读请求时,第一节点向第二节点(例如home agent)发送第二读请求,第一节点接收第二节点发送的第一读响应,第一读响应包括运算权限。In some embodiments, when the first node (such as L2_0) receives the first read request from multiple cache nodes of the first cache layer, the first node sends the second read request to the second node (such as home agent), The first node receives the first read response sent by the second node, where the first read response includes a computing authority.
其中,第二节点用于管理多个第一节点的一致性,第二节点和第二节点管理的多个第一节点属于同一die。第二节点可以为home agent时,可以理解为home agent用于管理多个L2 Cache的一致性,以图2为例,home agent、L2_0、L2_1、L1_0、L1_1、L1_2、L1_3、L1_4、L1_5、L1_6和L1_7属于同一die。Wherein, the second node is used to manage the consistency of the multiple first nodes, and the second node and the multiple first nodes managed by the second node belong to the same die. When the second node can be a home agent, it can be understood that the home agent is used to manage the consistency of multiple L2 caches. Taking Figure 2 as an example, home agent, L2_0, L2_1, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7 belong to the same die.
第二读请求用于请求第一地址的运算权限,第二读请求可以理解为L2 Cache向home agent发送的读请求,当L2 Cache中有第一地址的数据时,第二读请求用于请求第一地址的运算权限,当L2 Cache中没有第一地址的数据时,第二读请求还可以用于请求第一地址的数据和运算权限。The second read request is used to request the computing authority of the first address. The second read request can be understood as a read request sent by the L2 Cache to the home agent. When there is data at the first address in the L2 Cache, the second read request is used to request The operation authority of the first address, when there is no data of the first address in the L2 Cache, the second read request can also be used to request the data and operation authority of the first address.
示例性的,以图2为例,L2_0向home agent发送第二读请求,用于请求第一地址的运算权限,该第一地址是L1 Cache向L2 Cache发送读请求时所请求的地址。可以理解为L1 Cache向L2 Cache发送用于请求第一地址的运算权限的第一读请求,L2Cache收到该第一读请求之后,又向home agent发送用于请求第一地址的运算权限的第二读请求。home agent收到L2 Cache发送的第二读请求之后,向L2 Cache发送第一读响应,即发送第一地址的运算权限。Exemplarily, taking FIG. 2 as an example, L2_0 sends a second read request to the home agent, which is used to request the computing authority of the first address, and the first address is the address requested when the L1 Cache sends the read request to the L2 Cache. It can be understood that the L1 Cache sends the first read request to the L2 Cache to request the operation permission of the first address, and after receiving the first read request, the L2Cache sends the first read request to the home agent to request the operation permission of the first address. Second reading request. After the home agent receives the second read request sent by the L2 Cache, it sends the first read response to the L2 Cache, that is, sends the computing authority of the first address.
在一些实施例中,第二节点确定第一节点获取运算权限的时间到达第二时间段时,从第一节点获取第二数据,第二节点向第三节点(例如L2_1)发送第二数据和运算权限。In some embodiments, when the second node determines that the time when the first node obtains the computing authority reaches the second time period, it obtains the second data from the first node, and the second node sends the second data and Operational authority.
其中,第二数据为第一节点得到的第一地址的最新运算结果。由于L2 Cache会控制运算权限转移到多个L1 Cache,每个L1 Cache使用运算权限处理一个或多个第一 读请求时都会更新第一地址中的数据,因此第二数据可以理解为L2 Cache控制运算权限在多个L1 Cache中转移结束后,从运算权限转移的最后一个L1 Cache处获取的第一地址的最新运算结果。Wherein, the second data is the latest operation result of the first address obtained by the first node. Since the L2 Cache will control the transfer of computing authority to multiple L1 Cache, each L1 Cache will update the data in the first address when using the computing authority to process one or more first read requests, so the second data can be understood as L2 Cache control After the operation authority is transferred among multiple L1 Cache, the latest operation result of the first address is obtained from the last L1 Cache where the operation authority is transferred.
第三节点为与第一节点在同一缓存层且属于同一die的节点,以图2为例,当第一节点为L2_0时,第三节点可以为L2_1。The third node is a node in the same cache layer as the first node and belongs to the same die. Taking FIG. 2 as an example, when the first node is L2_0, the third node may be L2_1.
示例性的,home agent确定L2_0获取运算权限的时间到达第二时间段时,从L2_0获取第一地址的最新运算结果,向L2_1发送该第一地址的最新运算结果以及运算权限。Exemplarily, when the home agent determines that the time when L2_0 acquires the operation authority reaches the second time period, it obtains the latest operation result of the first address from L2_0, and sends the latest operation result of the first address and the operation authority to L2_1.
在一些实施例中,第一缓存层的多个缓存节点发送的第一读请求可以直接在第一节点处进行运算。In some embodiments, the first read request sent by multiple cache nodes of the first cache layer may be directly operated on the first node.
示例性的,home agent将运算权限发送到L2 Cache处,L1 Cache向L2 Cache发送的第一读请求可以直接在该L2 Cache处进行运算。Exemplarily, the home agent sends the computing authority to the L2 Cache, and the first read request sent by the L1 Cache to the L2 Cache can be directly computed at the L2 Cache.
不论是在L1 Cache处还是L2 Cache处对第一地址的数据进行运算,都属于near atomic运算,其运算的时延远小于far atomic运算的时延。No matter whether the operation on the data of the first address is performed at the L1 Cache or the L2 Cache, it is a near atomic operation, and its operation delay is much smaller than that of the far atomic operation.
如图5所示,为本申请实施例提供的一种访问数据的方法的流程图。当L1 Cache向L2 Cache发送第一读请求(图5中以rd示例),用于请求第一地址的数据和运算权限(图5中以E示例)时,L2 Cache按照多个L1 Cache分别发送的第一个第一读请求的顺序确定多个L1 Cache获取运算权限的顺序。L2 Cache中若是没有第一地址的数据和运算权限,则L2 Cache向home agent发送第二读请求(图5中以Rd示例),用于请求该第一地址的数据和运算权限,home agent按照多个L2 Cache分别发送的第一个第二读请求的顺序确定多个L2 Cache获取运算权限的顺序。As shown in FIG. 5 , it is a flow chart of a method for accessing data provided by the embodiment of the present application. When the L1 Cache sends the first read request to the L2 Cache (in the example of rd in Figure 5), which is used to request the data and computing authority of the first address (in the example of E in Figure 5), the L2 Cache sends them separately according to multiple L1 Cache The order of the first first read request determines the order in which multiple L1 Caches obtain computing permissions. If there is no data and operation authority of the first address in the L2 Cache, the L2 Cache sends a second read request to the home agent (Rd is used as an example in Figure 5) to request the data and operation authority of the first address, and the home agent follows The order of the first and second read requests sent by multiple L2 Caches determines the order in which multiple L2 Caches obtain computing permissions.
home agent获取到第一地址的数据和运算权限后,将第一地址的数据和运算权限发送给第一个获取运算权限的L2 Cache(假设为L2_0)。L2_0接收到第一地址的数据和运算权限后,将第一地址的数据和运算权限发送给第一个获取运算权限的L1 Cache(假设为L1_0),并确定L1_0获取运算权限的时间到达第一时间段时,从L1_0获取第一地址的最新运算结果。之后L2_0向第二个获取运算权限的L1 Cache(假设为L1_1)发送第一地址的最新运算结果和运算权限,使L1_1对第一地址的最新运算结果进行改写操作。以此类推,当最后一个获取运算权限的L1 Cache(假设为L1_3)对第一地址的最新运算结果完成改写操作后,由于L2_0不再控制运算权限进行转移,因此L1_3将其对第一地址的最新运算结果和运算权限反馈(图5中以反馈ACK示例)到L2_0。此时,home agent确定L2_0获取运算权限的时间到达第二时间段,home agent从L2_0获取第一地址的最新运算结果,向第二个获取运算权限的L2 Cache(假设为L2_1)发送第一地址的最新运算结果和运算权限,使L2_1对第一地址的最新运算结果进行改写操作。以此类推,当最后一个获取运算权限的L2 Cache(假设为L2_1)对第一地址的最新运算结果完成改写操作后,由于home agent不再控制运算权限进行转移,因此L2_1将其对第一地址的最新运算结果和运算权限反馈到home agent。home agent将最终接收到的第一地址的最新运算结果存入共享内存,完成一次运算权限调度的过程。After the home agent obtains the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L2 Cache (assumed to be L2_0) that obtains the operation authority. After L2_0 receives the data and operation authority of the first address, it sends the data and operation authority of the first address to the first L1 Cache (assumed to be L1_0) that obtains the operation authority, and determines that the time when L1_0 obtains the operation authority reaches the first During the period of time, the latest operation result of the first address is obtained from L1_0. Afterwards, L2_0 sends the latest calculation result and the calculation permission of the first address to the second L1 Cache (assumed to be L1_1) that obtains the calculation permission, so that L1_1 can rewrite the latest calculation result of the first address. By analogy, when the last L1 Cache (assumed to be L1_3) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, because L2_0 no longer controls the operation authority to transfer, so L1_3 rewrites the operation result of the first address The latest operation result and operation authority are fed back (the feedback ACK is used as an example in Figure 5) to L2_0. At this time, the home agent determines that the time when L2_0 obtains the operation authority reaches the second time period, and the home agent obtains the latest operation result of the first address from L2_0, and sends the first address to the second L2 Cache (assumed to be L2_1) that obtains the operation authority The latest operation result and the operation authority of the address enable L2_1 to rewrite the latest operation result of the first address. By analogy, when the last L2 Cache (assumed to be L2_1) that obtains the operation authority completes the rewriting operation on the latest operation result of the first address, since the home agent no longer controls the operation authority to transfer, L2_1 writes it to the first address The latest operation results and operation permissions are fed back to the home agent. The home agent stores the latest operation result of the first address it finally receives into the shared memory, and completes a process of operation permission scheduling.
由此,本申请实施例提供的一种访问数据的方法,通过一致性管理节点自下而上层级进行E态调度转移管理,按照不同层级的需求为不同层级设置不同的处理时间, 控制E状态在NUMA域内停留更长的时间以完成更多的原子指令,且CPU获取E状态后在CPU核内部缓存层级进行near atomic运算,来减少CPU完成原子运算的时延,可以避免原子指令在公共交织节点排队造成入口队列拥塞,降低原子操作的冲突率与系统开销,提升原子操作的吞吐量。Therefore, a method for accessing data provided by the embodiment of the present application implements E-state scheduling transfer management from bottom to top through the consistency management node, sets different processing times for different levels according to the requirements of different levels, and controls the E-state Stay in the NUMA domain for a longer time to complete more atomic instructions, and after the CPU obtains the E state, it performs near atomic operations at the cache level inside the CPU core to reduce the delay in completing atomic operations by the CPU and avoid atomic instructions being interleaved in public Node queuing causes congestion in the entry queue, reduces the conflict rate and system overhead of atomic operations, and improves the throughput of atomic operations.
图6A示出了一种芯片结构的示意图。该芯片包括多核CPU,多核CPU的专用缓存节点、共享缓存节点和第二节点。多核CPU例如包括图6A中的CPU0、CPU1、CPU2、CPU3、CPU4、CPU5、CPU6和CPU7;多核CPU的专用缓存节点例如包括图6A中的缓存节点0、缓存节点1、缓存节点2、缓存节点3、缓存节点4、缓存节点5、缓存节点6、缓存节点7。第一节点为缓存节点0、缓存节点1、缓存节点2、缓存节点3的一致性管理节点,并且组成簇0。第三节点为缓存节点4、缓存节点5、缓存节点6、缓存节点7的一致性管理节点,并且组成簇1。第二节点和第三节点属于同一die,第二节点为第二节点和第三节点的一致性节点。FIG. 6A shows a schematic diagram of a chip structure. The chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node. The multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6A; the dedicated cache nodes of the multi-core CPU include, for example, cache node 0, cache node 1, cache node 2, and cache node in FIG. 6A. 3. Cache node 4, cache node 5, cache node 6, and cache node 7. The first node is a consistency management node of cache node 0, cache node 1, cache node 2, and cache node 3, and forms cluster 0. The third node is a consistency management node of cache node 4 , cache node 5 , cache node 6 , and cache node 7 , and forms cluster 1 . The second node and the third node belong to the same die, and the second node is a consistent node of the second node and the third node.
图6A示出的芯片中,第一节点可以理解为L2 Cache,多个缓存节点可以理解为L1 Cache,此时,第二节点可以理解为home agent,第三节点为与第一节点位于同一缓存层级的另一个L2 Cache,每个缓存节点对应一个CPU,该缓存节点为该CPU的专用缓存。图6A中的第一节点可以用于执行上述步骤401、步骤402和步骤403中,缓存节点0-缓存节点7处于L1缓存层,第一节点和第三节点处于L2缓存层,第二节点为home agent时的相关方法步骤等,和/或用于本文所描述的技术的其他过程。In the chip shown in Figure 6A, the first node can be understood as L2 Cache, multiple cache nodes can be understood as L1 Cache, at this time, the second node can be understood as home agent, and the third node is located in the same cache as the first node Another level of L2 Cache, each cache node corresponds to a CPU, the cache node is the dedicated cache of the CPU. The first node in FIG. 6A can be used to perform the above step 401, step 402 and step 403. Cache node 0-cache node 7 are in the L1 cache layer, the first node and the third node are in the L2 cache layer, and the second node is Relevant method steps when home agent, etc., and/or other processes for the techniques described herein.
图6B示出了另一种芯片结构的示意图。该芯片包括多核CPU,多核CPU的专用缓存节点、共享缓存节点和第二节点。多核CPU例如包括图6B中的CPU0、CPU1、CPU2、CPU3、CPU4、CPU5、CPU6和CPU7;多核CPU的专用缓存节点例如包括图6B中的L1_0、L1_1、L1_2、L1_3、L1_4、L1_5、L1_6以及L1_7。缓存节点0为L1_0、L1_1、L1_2、L1_3的一致性节点,并且组成簇0;缓存节点1为L1_4、L1_5、L1_6以及L1_7的一致性节点,并且组成簇1。第一节点为缓存节点0、缓存节点1的一致性节点。FIG. 6B shows a schematic diagram of another chip structure. The chip includes a multi-core CPU, a dedicated cache node for the multi-core CPU, a shared cache node and a second node. The multi-core CPU includes, for example, CPU0, CPU1, CPU2, CPU3, CPU4, CPU5, CPU6, and CPU7 in FIG. 6B; the dedicated cache nodes of the multi-core CPU include, for example, L1_0, L1_1, L1_2, L1_3, L1_4, L1_5, L1_6 and L1_7. Cache node 0 is a coherent node of L1_0, L1_1, L1_2, and L1_3, and forms cluster 0; cache node 1 is a coherent node of L1_4, L1_5, L1_6, and L1_7, and forms cluster 1. The first node is a coherent node of cache node 0 and cache node 1 .
图6B示出的芯片中,第一节点可以理解为home agent,多个缓存节点可以理解为L2 Cache,每个缓存节点还用于控制多个CPU的专用缓存L1 Cache。图6B中的第一节点可以用于执行上述步骤401、步骤402和步骤403中第一节点为home agent,缓存节点0和1为L2 Cache时的方法步骤等,和/或用于本文所描述的技术的其他过程。本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序代码,当处理器执行该计算机程序代码时,通信装置执行上述实施例中访问数据的方法。In the chip shown in FIG. 6B , the first node can be understood as a home agent, multiple cache nodes can be understood as L2 Cache, and each cache node is also used to control the dedicated cache L1 Cache of multiple CPUs. The first node in FIG. 6B can be used to perform the method steps when the first node in the above steps 401, 402, and 403 is a home agent, and cache nodes 0 and 1 are L2 Cache, etc., and/or used in the methods described herein other processes of the technology. The embodiment of the present application also provides a computer-readable storage medium, in which computer program code is stored, and when the processor executes the computer program code, the communication device executes the method for accessing data in the above-mentioned embodiments.
本申请的实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中通信装置执行的访问数据的方法。Embodiments of the present application also provide a computer program product, which, when running on a computer, causes the computer to execute the above-mentioned related steps, so as to realize the method for accessing data performed by the communication device in the above-mentioned embodiments.
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。Through the description of the above embodiments, those skilled in the art can understand that for the convenience and brevity of the description, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be assigned by different Completion of functional modules means that the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过 其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium Among them, several instructions are included to make a device (which may be a single-chip microcomputer, a chip, etc.) or a processor (processor) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read only memory (ROM), random access memory (random access memory, RAM), magnetic disk or optical disk.
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above content is only the specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (15)

  1. 一种访问数据的方法,其特征在于,所述方法包括:A method for accessing data, characterized in that the method comprises:
    第一节点接收第一缓存层的多个缓存节点发送的多个第一读请求,所述多个第一读请求均用于请求第一地址的运算权限;所述第一节点用于管理所述多个缓存节点的一致性;The first node receives multiple first read requests sent by multiple cache nodes in the first cache layer, and the multiple first read requests are all used to request the computing authority of the first address; the first node is used to manage all Describe the consistency of multiple cache nodes;
    所述第一节点按照所述多个缓存节点分别发送的第一个第一读请求的顺序确定所述多个缓存节点获取所述运算权限的顺序;The first node determines the order in which the plurality of cache nodes obtain the computing authority according to the order of the first first read requests sent by the plurality of cache nodes respectively;
    所述第一节点获取到所述运算权限时,按照所述多个缓存节点获取所述运算权限的顺序,控制所述运算权限在所述多个缓存节点间转移。When the first node acquires the computing permission, it controls the transfer of the computing permission among the multiple cache nodes according to the sequence in which the multiple cache nodes acquire the computing permission.
  2. 根据权利要求1所述的方法,其特征在于,所述第一节点获取到所述运算权限时,按照所述多个缓存节点分别发送的第一个第一读请求的顺序,控制所述运算权限在所述多个缓存节点间转移包括:The method according to claim 1, wherein when the first node obtains the operation authority, it controls the operation according to the order of the first first read requests sent by the plurality of cache nodes respectively. The transfer of authority among the plurality of cache nodes includes:
    所述第一节点获取到所述运算权限时,若第一缓存节点为所述多个缓存节点中第一个获取所述运算权限的节点,则所述第一节点向所述第一缓存节点发送所述运算权限;When the first node obtains the computing permission, if the first cache node is the first node to obtain the computing permission among the plurality of cache nodes, then the first node sends a request to the first cache node Send the computing authority;
    所述第一节点从所述第一缓存节点获取第一数据,所述第一数据为所述第一缓存节点对所述第一地址的运算结果,并向第二缓存节点发送所述第一数据和所述运算权限,所述第二缓存节点为所述多个缓存节点中第二个获取所述运算权限的节点。The first node acquires first data from the first cache node, the first data is an operation result of the first address by the first cache node, and sends the first cache node to the second cache node data and the computing permission, the second cache node is the second node that obtains the computing permission among the plurality of cache nodes.
  3. 根据权利要求2所述的方法,其特征在于,所述第一节点从所述第一缓存节点获取所述第一数据包括:The method according to claim 2, wherein obtaining the first data from the first cache node by the first node comprises:
    所述第一节点确定所述第一缓存节点获取所述运算权限的时间到达第一时间段时,从所述第一缓存节点获取所述第一数据;The first node acquires the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches a first time period;
    其中,所述第一时间段是所述第一节点根据所述第一节点所在的层级确定的。Wherein, the first time period is determined by the first node according to the level where the first node is located.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述第一缓存层为一级L1缓存层,所述多个缓存节点分别对应多个中央处理器CPU的专有缓存;The method according to any one of claims 1-3, wherein the first cache layer is a first-level L1 cache layer, and the multiple cache nodes correspond to the dedicated caches of multiple central processing units (CPUs);
    所述第一节点为二级L2缓存层的节点,所述第一节点为所述多个缓存节点对应的多个CPU的共享缓存;The first node is a node of the secondary L2 cache layer, and the first node is a shared cache of multiple CPUs corresponding to the multiple cache nodes;
    所述多个缓存节点和所述第一节点属于同一非一致内存访问NUMA域。The multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
  5. 根据权利要求4所述的方法,其特征在于,所述第一节点获取到所述运算权限包括:The method according to claim 4, wherein obtaining the computing authority by the first node comprises:
    所述第一节点向第二节点发送第二读请求,所述第二读请求用于请求所述第一地址的所述运算权限;所述第二节点用于管理多个第一节点的一致性,所述第二节点和所述第二节点管理的所述多个第一节点属于同一die;The first node sends a second read request to the second node, and the second read request is used to request the operation authority of the first address; the second node is used to manage the consistency of multiple first nodes property, the second node and the plurality of first nodes managed by the second node belong to the same die;
    所述第一节点接收所述第二节点发送的第一读响应,所述第一读响应包括所述运算权限。The first node receives a first read response sent by the second node, where the first read response includes the computing authority.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method according to claim 5, wherein the method further comprises:
    所述第二节点确定所述第一节点获取所述运算权限的时间到达第二时间段时,从所述第一节点获取第二数据,所述第二数据为所述第一节点得到的所述第一地址的最新运算结果,所述第二节点向第三节点发送所述第二数据和所述运算权限,所述第三 节点为与所述第一节点在同一缓存层且属于同一die的节点。When the second node determines that the time when the first node obtains the computing authority reaches a second time period, it obtains second data from the first node, and the second data is all obtained by the first node. The latest operation result of the first address, the second node sends the second data and the operation authority to a third node, and the third node is in the same cache layer as the first node and belongs to the same die of nodes.
  7. 根据权利要求1-3任一项所述的方法,其特征在于,所述第一缓存层为二级L2缓存层,所述多个缓存节点分别为多个CPU的共享缓存;The method according to any one of claims 1-3, wherein the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs;
    所述第一节点为缓存的本地代理,用于对内存进行读写操作。The first node is a local agent of the cache, and is used for performing read and write operations on the memory.
  8. 一种通信装置,其特征在于,所述通信装置包括第一节点和第一缓存层的多个缓存节点,所述第一节点用于:A communication device, characterized in that the communication device includes a first node and a plurality of cache nodes in a first cache layer, and the first node is used for:
    接收所述多个缓存节点发送的多个第一读请求,所述多个第一读请求均用于请求第一地址的运算权限;所述第一节点用于管理所述多个缓存节点的一致性;receiving a plurality of first read requests sent by the plurality of cache nodes, the plurality of first read requests are all used to request the operation authority of the first address; the first node is used to manage the plurality of cache nodes consistency;
    按照所述多个缓存节点分别发送的第一个第一读请求的顺序确定所述多个缓存节点获取所述运算权限的顺序;determining the order in which the plurality of cache nodes acquire the computing authority according to the order of the first first read requests sent by the plurality of cache nodes respectively;
    获取到所述运算权限时,按照所述多个缓存节点获取所述运算权限的顺序,控制所述运算权限在所述多个缓存节点间转移。When the computing permission is acquired, the computing permission is controlled to be transferred among the multiple cache nodes according to the sequence in which the multiple cache nodes acquire the computing permission.
  9. 根据权利要求8所述的通信装置,其特征在于,所述第一节点具体用于:The communication device according to claim 8, wherein the first node is specifically used for:
    获取到所述运算权限时,若第一缓存节点为所述多个缓存节点中第一个获取所述运算权限的节点,则向所述第一缓存节点发送所述运算权限;When the computing permission is obtained, if the first cache node is the first node to obtain the computing permission among the plurality of cache nodes, send the computing permission to the first cache node;
    从所述第一缓存节点获取第一数据,所述第一数据为所述第一缓存节点对所述第一地址的运算结果,并向第二缓存节点发送所述第一数据和所述运算权限,所述第二缓存节点为所述多个缓存节点中第二个获取所述运算权限的节点。Acquiring first data from the first cache node, the first data being an operation result of the first address by the first cache node, and sending the first data and the operation to a second cache node authority, the second cache node is the second node that obtains the computing authority among the plurality of cache nodes.
  10. 根据权利要求9所述的通信装置,其特征在于,所述第一节点具体用于:The communication device according to claim 9, wherein the first node is specifically used for:
    确定所述第一缓存节点获取所述运算权限的时间到达第一时间段时,从所述第一缓存节点获取所述第一数据;Obtaining the first data from the first cache node when it is determined that the time when the first cache node acquires the computing authority reaches a first time period;
    其中,所述第一时间段是所述第一节点根据所述第一节点所在的层级确定的。Wherein, the first time period is determined by the first node according to the level where the first node is located.
  11. 根据权利要求8-10任一项所述的通信装置,其特征在于,所述第一缓存层为一级L1缓存层,所述多个缓存节点分别对应多个中央处理器CPU的专有缓存;The communication device according to any one of claims 8-10, wherein the first cache layer is a first-level L1 cache layer, and the multiple cache nodes correspond to dedicated caches of multiple central processing units (CPUs) respectively. ;
    所述第一节点为二级L2缓存层的节点,所述第一节点为所述多个缓存节点对应的多个CPU的共享缓存;The first node is a node of the secondary L2 cache layer, and the first node is a shared cache of multiple CPUs corresponding to the multiple cache nodes;
    所述多个缓存节点和所述第一节点属于同一非一致内存访问NUMA域。The multiple cache nodes and the first node belong to the same non-uniform memory access NUMA domain.
  12. 根据权利要求11所述的通信装置,其特征在于,所述第一节点具体用于:The communication device according to claim 11, wherein the first node is specifically used for:
    向第二节点发送第二读请求,所述第二读请求用于请求所述第一地址的所述运算权限;所述第二节点用于管理多个第一节点的一致性,所述第二节点和所述第二节点管理的所述多个第一节点属于同一die;sending a second read request to a second node, where the second read request is used to request the operation authority of the first address; the second node is used to manage the consistency of multiple first nodes, and the second The second node and the plurality of first nodes managed by the second node belong to the same die;
    接收所述第二节点发送的第一读响应,所述第一读响应包括所述运算权限。Receive a first read response sent by the second node, where the first read response includes the computing authority.
  13. 根据权利要求12所述的通信装置,其特征在于,The communication device according to claim 12, wherein:
    所述第二节点确定所述第一节点获取所述运算权限的时间到达第二时间段时,从所述第一节点获取第二数据,所述第二数据为所述第一节点得到的所述第一地址的最新运算结果,所述第二节点向第三节点发送所述第二数据和所述运算权限,所述第三节点为与所述第一节点在同一缓存层且属于同一die的节点。When the second node determines that the time when the first node obtains the computing authority reaches a second time period, it obtains second data from the first node, and the second data is all obtained by the first node. The latest operation result of the first address, the second node sends the second data and the operation authority to a third node, and the third node is in the same cache layer as the first node and belongs to the same die of nodes.
  14. 根据权利要求8-10任一项所述的通信装置,其特征在于,所述第一缓存层为二级L2缓存层,所述多个缓存节点分别为多个CPU的共享缓存;The communication device according to any one of claims 8-10, wherein the first cache layer is a secondary L2 cache layer, and the multiple cache nodes are respectively shared caches of multiple CPUs;
    所述第一节点为缓存的本地代理,用于对内存进行读写操作。The first node is a local agent of the cache, and is used for performing read and write operations on the memory.
  15. 一种计算机可读存储介质,其特征在于,包括计算机指令,当计算机指令在通信装置上运行时,使得通信装置执行上述权利要求1-7中的任一项所述的方法。A computer-readable storage medium is characterized by comprising computer instructions, and when the computer instructions are run on the communication device, the communication device is made to execute the method described in any one of claims 1-7.
PCT/CN2021/096550 2021-05-27 2021-05-27 Data access method and apparatus WO2022246769A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/096550 WO2022246769A1 (en) 2021-05-27 2021-05-27 Data access method and apparatus
CN202180086851.0A CN116685958A (en) 2021-05-27 2021-05-27 Method and device for accessing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/096550 WO2022246769A1 (en) 2021-05-27 2021-05-27 Data access method and apparatus

Publications (1)

Publication Number Publication Date
WO2022246769A1 true WO2022246769A1 (en) 2022-12-01

Family

ID=84229452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096550 WO2022246769A1 (en) 2021-05-27 2021-05-27 Data access method and apparatus

Country Status (2)

Country Link
CN (1) CN116685958A (en)
WO (1) WO2022246769A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030041215A1 (en) * 2001-08-27 2003-02-27 George Robert T. Method and apparatus for the utilization of distributed caches
CN101030171A (en) * 2006-02-28 2007-09-05 国际商业机器公司 Data processing system, cache system and method for reducing imprecise invalid coherency states
CN102819420A (en) * 2012-07-31 2012-12-12 中国人民解放军国防科学技术大学 Command cancel-based cache production line lock-step concurrent execution method
CN108257078A (en) * 2016-12-28 2018-07-06 英特尔公司 Memory knows the source of reordering

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030041215A1 (en) * 2001-08-27 2003-02-27 George Robert T. Method and apparatus for the utilization of distributed caches
CN101030171A (en) * 2006-02-28 2007-09-05 国际商业机器公司 Data processing system, cache system and method for reducing imprecise invalid coherency states
CN102819420A (en) * 2012-07-31 2012-12-12 中国人民解放军国防科学技术大学 Command cancel-based cache production line lock-step concurrent execution method
CN108257078A (en) * 2016-12-28 2018-07-06 英特尔公司 Memory knows the source of reordering

Also Published As

Publication number Publication date
CN116685958A (en) 2023-09-01

Similar Documents

Publication Publication Date Title
US11334262B2 (en) On-chip atomic transaction engine
US10210092B1 (en) Managing cache access and streaming data
US20240086065A1 (en) Delayed snoop for improved multi-process false sharing parallel thread performance
US7934061B2 (en) Methods and arrangements to manage on-chip memory to reduce memory latency
US8904154B2 (en) Execution migration
US5692149A (en) Block replacement method in cache only memory architecture multiprocessor
US10162757B2 (en) Proactive cache coherence
JP4566264B2 (en) Method, system, apparatus, and program for performing cache line polling by cross-referencing with related applications using store and reserve instructions
US20140006716A1 (en) Data control using last accessor information
US20090083496A1 (en) Method for Improved Performance With New Buffers on NUMA Systems
JP5752918B2 (en) Multiprocessor and cache coherency management apparatus and method thereof
WO2022246769A1 (en) Data access method and apparatus
JPH052534A (en) Hierarchical cache memory device
CN111414318B (en) Data consistency implementation method based on advanced updating
JPH06309231A (en) Cache memory control method
WO2024140543A1 (en) Cc-numa server, lock request processing method, and related apparatus
US11847061B2 (en) Approach for supporting memory-centric operations on cached data
US20230315636A1 (en) Multiprocessor system cache management with non-authority designation
CN116680229A (en) Operation method of distributed shared memory protocol
US8230173B2 (en) Cache memory system, data processing apparatus, and storage apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942345

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180086851.0

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21942345

Country of ref document: EP

Kind code of ref document: A1