WO2024051292A1 - 数据处理系统、内存镜像方法、装置和计算设备 - Google Patents

数据处理系统、内存镜像方法、装置和计算设备 Download PDF

Info

Publication number
WO2024051292A1
WO2024051292A1 PCT/CN2023/102963 CN2023102963W WO2024051292A1 WO 2024051292 A1 WO2024051292 A1 WO 2024051292A1 CN 2023102963 W CN2023102963 W CN 2023102963W WO 2024051292 A1 WO2024051292 A1 WO 2024051292A1
Authority
WO
WIPO (PCT)
Prior art keywords
area
node
memory
data
mirror
Prior art date
Application number
PCT/CN2023/102963
Other languages
English (en)
French (fr)
Inventor
陈智勇
孙宏伟
潘伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024051292A1 publication Critical patent/WO2024051292A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation

Definitions

  • the present application relates to the field of computers, and in particular, to a data processing system, a memory mirroring method, a device and a computing device.
  • Memory mirroring is an effective means to solve memory uncorrectable errors (UCE), that is, a part of the storage space in the memory is used as a mirror area of another part of the storage space to store backup data.
  • UCE memory uncorrectable errors
  • memory mirroring is implemented using static configuration or by the operating system allocating adjacent pages in memory as a mirror area. If the mirror area is too large, memory storage resources will be wasted. If the mirror area is too small, the memory UCE cannot be resolved. Therefore, the current memory mirroring configuration is inflexible, resulting in low utilization of storage resources.
  • This application provides a data processing system, a memory mirroring method, a device, and a computing device, thereby enabling flexible configuration of memory mirroring and improving the utilization of memory storage resources.
  • a data processing system includes multiple nodes and management nodes.
  • the first node is used to request to mirror the first area in the memory used by the first node;
  • the management node is used to allocate the second area, that is, the first area is the area to be mirrored, and the second area is the first area.
  • the second area is used to indicate the storage space in the second node that is the same size as the first area, and the second area is used to back up and store data in the first area.
  • the solution provided by this application uses the storage resources in the system to store different data when the node does not request memory mirroring.
  • the mirror area is allocated from the system's storage resources, so that the mirror area backs up and stores the data to be stored in the mirror area, improving data reliability.
  • the solution provided by this application does not limit the positional relationship between the area to be mirrored and the mirror area.
  • the area to be mirrored and the mirror area can be storage in different nodes. space, thereby flexibly and dynamically allocating the mirror area to implement memory mirroring, improving the flexibility of memory mirroring configuration and the utilization of storage resources.
  • the first node indicates the first physical address of the first area; the management node is also used to generate a mirror relationship between the first area and the second area, and the mirror relationship is used to Indicates the corresponding relationship between the first physical address and the second physical address, and the second physical address is used to indicate the second area. Therefore, when the first node performs a read operation or a write operation on the first area, it is convenient for the management node to determine the mirror area of the first area according to the mirror relationship, and perform a write operation on the mirror area of the first area, or when the first area occurs When an error cannot be corrected, the first data is read from the second area to avoid data processing failure.
  • the management node is further configured to receive a write instruction sent by the first node, and write the first data into the first area and the second area.
  • the write instruction is used to instruct the first data to be stored in the first area.
  • the management node is also configured to receive a read instruction from the first node, where the read instruction is used to instruct reading of the first data from the first area; the management node is also configured to when an uncorrectable event does not occur in the first area. In case of error, the first data is read from the first area.
  • the management node is also used to read the first data from the second area when an uncorrectable error occurs in the first area, so that the first node successfully reads the first data and avoids the impact on the business of the required first data. .
  • the first area is the main storage space and the second area is the backup storage space; the management node is also used to store the second area when an uncorrectable error occurs in the first area.
  • the area is determined as the main storage space.
  • the management node is also used to instruct the first node to transfer the The image ID is modified to be invalid. Therefore, it is convenient for the node to release the storage resources in the first area and improve the utilization rate of the storage resources.
  • the size of the first area is determined by application requirements.
  • the second area includes any one of the local storage space of the second node, the extended storage space of the second node, and the storage space of the second node in the global memory pool.
  • the management node supports the cache consistency protocol.
  • a memory mirroring method in a second aspect, includes: the first node requests to mirror the first area in the memory used by the first node; the management node allocates the second area.
  • the second area is a mirror area of the first area, the second area is used to indicate the storage space in the second node that is the same size as the first area, and the second area is used to back up and store data in the first area.
  • the first node indicates the first physical address of the first area; the method further includes: the management node generates a mirror relationship between the first area and the second area, and the mirror relationship is used for Indicates the corresponding relationship between the first physical address and the second physical address, and the second physical address is used to indicate the second area.
  • the method further includes: the management node receives a write instruction sent by the first node, where the write instruction is used to instruct the first data to be stored in the first area; the management node stores the first data in the first area. Data is written to the first area and the second area.
  • the method further includes: the management node receives a read instruction from the first node, the read instruction is used to instruct reading of the first data from the first area; the management node when the first area When no uncorrectable error occurs, the first data is read from the first area.
  • the method further includes: when an uncorrectable error occurs in the first area, the management node reads the first data from the second area.
  • the first area is the main storage space and the second area is the backup storage space.
  • the method also includes: when an uncorrectable error occurs in the first area, the management node The area is determined as the main storage space.
  • the method further includes: the management node instructs the first node to modify the mirror identification of the first area to be invalid.
  • the size of the first area is determined by application requirements.
  • the second area includes any one of the local storage space of the second node, the extended storage space of the second node, and the storage space of the second node in the global memory pool.
  • the management node supports the cache consistency protocol.
  • a management device which device includes various modules for executing the method executed by the management node in the second aspect or any possible design of the second aspect.
  • a fourth aspect provides a data processing node, which includes various modules for executing the method performed by the node in the second aspect or any possible design of the second aspect.
  • a computing device in a fifth aspect, includes at least one processor and a memory, and the memory is used to store a set of computer instructions; when the processor serves as a manager in the second aspect or any of the possible implementations of the second aspect, When the node executes the set of computer instructions, it executes the operating steps of the memory mirroring method in the second aspect or any possible implementation of the second aspect.
  • a chip including: a processor and a power supply circuit; wherein the power supply circuit is used to supply power to the processor; and the processor is used to execute the second aspect or any possible implementation of the second aspect.
  • the operation steps of the memory mirroring method in the method including: a processor and a power supply circuit; wherein the power supply circuit is used to supply power to the processor; and the processor is used to execute the second aspect or any possible implementation of the second aspect.
  • a computer-readable storage medium including: computer software instructions; when the computer software instructions are run in a computing device, the computing device is caused to execute as in the second aspect or any possible implementation of the second aspect. The steps of the method.
  • a computer program product is provided.
  • the computer program product When the computer program product is run on a computer, it causes the computing device to perform the operation steps of the method described in the second aspect or any possible implementation of the second aspect.
  • Figure 1 is a schematic architectural diagram of a data processing system provided by this application.
  • Figure 2 is a schematic diagram of a deployment scenario of a global memory pool provided by this application.
  • Figure 3 is a schematic flow chart of a memory mirroring method provided by this application.
  • Figure 4 is a schematic flow chart of a data processing method provided by this application.
  • FIG. 5 is a schematic structural diagram of a management device provided by this application.
  • Figure 6 is a schematic structural diagram of a computing device provided by this application.
  • Memory is also called internal memory and main memory. Memory is an important component of the computer system, that is, the bridge between external memory (or auxiliary memory) and the central processing unit (CPU). Memory is used to temporarily store operation data in the CPU and data exchanged between the CPU and external memories such as hard disks. For example, when the computer starts running, the data that needs to be calculated is loaded from the memory into the CPU for calculation. After the calculation is completed, the CPU stores the calculation results in the memory.
  • CE Correctable Error
  • ECC Error Correction Code
  • Uncorrectable Error refers to when the memory error exceeds the error correction capability of ECC and ECC technology cannot be used to correct the memory error. If the storage space where an uncorrectable error occurs in the memory has a mirror area configured, the backup data of the storage space can be obtained from the mirror area.
  • Global Mirror refers to using half of the storage space in the memory as a mirror area for the other half of the storage space, which is used to back up and store the data stored in the other half of the storage space.
  • Partial mirroring also called memory address mirroring based on address ranges, refers to using half of the storage space indicated by an address segment in the memory as a mirror area for the other half.
  • Cacheline refers to the unit in which a computer device performs read or write operations on the memory storage space.
  • the size of a cache line can be 64 bytes (byte, B).
  • Interleaving refers to evenly distributing the data accessed to the memory to multiple memory channels according to the unit storage space (for example, cache line).
  • the interleaving method can be configured by the system administrator and can be interleaved between multiple memory channels connected to a processor, or between multiple memory channels on multiple processors.
  • Memory channel refers to multiple memories connected to the processor in a computer device.
  • the processor can use interleaving technology to operate on memory. For example, the processor evenly distributes the data to be written to memory across multiple memory channels based on the size of the cache line. In turn, the processor reads data from multiple memory channels based on the size of the cache line. Therefore, data processing is performed based on multiple memory channels to improve the memory bandwidth utilization and processing performance of the computer device.
  • Super Node refers to interconnecting multiple nodes into a high-performance cluster through high-bandwidth, low-latency inter-chip interconnect buses and switches.
  • the scale of the supernode is larger than the node scale under the Cache-Coherent Non-Uniform Memory Access (CC-NUMA) architecture, and the interconnection bandwidth of the nodes within the supernode is larger than the Ethernet interconnection bandwidth.
  • CC-NUMA Cache-Coherent Non-Uniform Memory Access
  • High Performance Computing (HPC) cluster refers to a computer cluster system.
  • HPC clusters contain multiple computers connected together using various interconnect technologies.
  • the interconnection technology may be, for example, infinite bandwidth technology (infiniband, IB), Remote Direct Memory Access over Converged Ethernet (RoCE) based on Converged Ethernet (Remote Direct Memory Access over Converged Ethernet, RoCE), or Transmission Control Protocol (Transmission Control Protocol, TCP).
  • IB infinite bandwidth technology
  • RoCE Remote Direct Memory Access over Converged Ethernet
  • RoCE Remote Direct Memory Access over Converged Ethernet
  • TCP Transmission Control Protocol
  • HPC provides ultra-high floating-point computing capabilities and can be used to solve the computing needs of computing-intensive and massive data processing services.
  • the combined computing power of multiple computers connected together can handle large computing problems.
  • HPC clusters For example, industries such as scientific research, weather forecasting, finance, simulation experiments, biopharmaceuticals, gene sequencing, and image processing involve the use of HPC clusters to solve large-scale computing problems and computing needs. Using HPC clusters to handle large-scale computing problems can effectively shorten the computing time for processing data and improve computing accuracy.
  • Memory operation instructions can be called memory semantics or memory operation functions.
  • Memory operation instructions include at least one of memory allocation (malloc), memory set (memset), memory copy (memcpy), memory move (memmove), memory release (memory release) and memory comparison (memcmp).
  • Memory allocation is used to allocate a section of memory to support application running.
  • Memory settings are used to set the data mode of the global memory pool, such as initialization.
  • Memory copy is used to copy the data stored in the storage space indicated by the source address (source) to the storage space indicated by the destination address (destination).
  • Memory movement is used to copy the data stored in the storage space indicated by the source address (source) to the storage space indicated by the destination address (destination), and delete the data stored in the storage space indicated by the source address (source).
  • Memory comparison is used to compare whether the data stored in two storage spaces are equal.
  • Memory release is used to release data stored in memory to improve the utilization of system memory resources and thereby improve system performance.
  • the data processing system includes multiple nodes and management nodes.
  • the management node allocates the second area, that is, the first area is the area to be mirrored, the second area is the mirror area of the first area, and the second area
  • the area is used to indicate the storage space in the second node that is the same size as the first area, and the second area is used to back up and store data in the first area.
  • the memory mirroring method uses the storage resources in the system to store different data when there is no memory mirroring requirement.
  • the mirror area is allocated from the system's storage resources so that the mirror area backs up and stores the data to be stored in the mirror area, improving data reliability.
  • the method of this application does not limit the positional relationship between the area to be mirrored and the mirror area.
  • the area to be mirrored and the mirror area can be storage spaces in different nodes. , thereby flexibly and dynamically allocating the mirror area to implement memory mirroring, improving the flexibility of memory mirroring configuration and the utilization of storage resources.
  • FIG 1 is a schematic architectural diagram of a data processing system provided by this application.
  • data processing system 100 is an entity that provides high performance computing.
  • Data processing system 100 includes a plurality of nodes 110 .
  • Nodes 110 may include compute nodes and storage nodes.
  • the node 110 may be a processor, a server, a desktop computer, a smart network card, a memory expansion card, a controller and a memory of a storage array, etc.
  • the processor can be a central processing unit (CPU), a graphics processing unit (GPU), a data processing unit (DPU), a neural processing unit (NPU), and an embedded processor. Neural-network processing unit (NPU) and other XPUs used for data processing.
  • the node 110 When the node 110 is an XPU for data processing such as GPU, DPU, NPU, etc. with high computing power, the node 110 can be used as an accelerator to offload the tasks of the general processor (such as CPU) to the accelerator, and the accelerator processes the calculations. Jobs with high demand (such as HPC, big data jobs, database jobs, etc.) solve the problem of insufficient floating point computing power of general-purpose processors, which cannot meet the heavy floating point computing needs of HPC, artificial intelligence (Artificial Intelligence, AI) and other scenarios. problems, thereby shortening the data processing time and reducing system energy consumption, and improving system performance.
  • the computing power of a node can also be called the computing power of the node.
  • accelerators may also be integrated inside node 110 . Independently deployed accelerators and nodes integrating accelerators support flexible plugging and unplugging, and can flexibly expand the scale of the data processing system on demand to meet the computing needs of different application scenarios.
  • a storage node includes one or more controllers, network cards, and multiple hard disks.
  • Hard drives are used to store data.
  • the hard disk can be a magnetic disk or other type of storage medium, such as a solid state drive or a shingled magnetic recording hard drive.
  • Network cards are used to communicate with the computing nodes contained in the computing cluster.
  • the controller is used to write data to the hard disk or read data from the hard disk according to the read/write data request sent by the computing node. In the process of reading and writing data, the controller needs to convert the address carried in the read/write data request into an address that the hard disk can recognize.
  • a management node 120 (eg, a switch) connects multiple nodes 110 based on high-speed interconnection links.
  • the management node 120 connects multiple nodes 110 through optical fiber, copper cable or copper wire.
  • the management node can be called a switching chip or an interconnect chip or a Baseboard Management Controller (BMC).
  • BMC Baseboard Management Controller
  • the data processing system 100 composed of multiple nodes 110 connected by the management node 120 based on high-speed interconnection links may also be called a super node.
  • Multiple supernodes are connected through a data center network.
  • the data center network includes multiple core switches and multiple aggregation switches.
  • Data center networks can form a scale domain.
  • Multiple supernodes can form a performance domain.
  • Two or more super nodes can form a macro cabinet. Macro cabinets can also be connected based on the data center network.
  • the management node 120 is configured to allocate a mirror area with the same size as the area to be mirrored to the area to be mirrored in the memory used by the node 110 according to the memory mirroring requirement issued by the node 110 .
  • the management node 120 can support Compute Fast Link (Compute Express Link, CXL) and other cache coherence protocols maintain the high performance, low latency and data consistency of memory mirroring.
  • Compute Fast Link Computer Express Link, CXL
  • multiple nodes 110 are directly connected based on high-speed interconnection links with high bandwidth and low latency.
  • the node 110 has the function of the management node 120 provided by this application.
  • the data processing system 100 supports running big data, database, high-performance computing, artificial intelligence, distributed storage, cloud native and other applications.
  • the data that needs to be backed up and stored in the embodiment of this application includes virtual machines (Virtual Machine, VM), containers, high availability (High Available, HA) applications, big data, databases, high-performance computing, artificial intelligence (Artificial Intelligence, AI) ), distributed storage, cloud native and other applications business data.
  • virtual machines Virtual Machine, VM
  • containers high availability (High Available, HA) applications
  • big data databases
  • high-performance computing Artificial intelligence (Artificial Intelligence, AI)
  • AI Artificial Intelligence
  • the area to be mirrored and the mirroring area may be storage spaces in different nodes.
  • the mirror area can be provided by the local storage medium, extended storage medium or global memory pool of any node 110 in the system.
  • the storage media of the nodes 110 in the data processing system 100 are uniformly addressed to form a global memory pool, enabling memory semantic access across nodes within the supernode (referred to as: cross-node).
  • the global memory pool is a node-shared resource composed of the node's storage media through unified addressing.
  • the global memory pool provided by this application may include the storage medium of the computing node and the storage medium of the storage node in the super node.
  • the storage medium of the computing node includes at least one of a local storage medium within the computing node and an extended storage medium connected to the computing node.
  • the storage medium of the storage node includes at least one of a local storage medium within the storage node and an extended storage medium connected to the storage node.
  • the global memory pool includes local storage media within computing nodes and local storage media within storage nodes.
  • the global memory pool includes local storage media within the computing node, extended storage media connected to the computing node, and any one of local storage media within the storage node and extended storage media connected to the storage node.
  • the global memory pool includes local storage media within the computing node, extended storage media connected to the computing node, local storage media within the storage node, and extended storage media connected to the storage node.
  • the global memory pool 200 includes a storage medium 210 in each of the N computing nodes, an extended storage medium 220 connected to each of the N computing nodes, and a storage medium 230 in each of the M storage nodes.
  • An expansion storage medium 240 connected to each of the M storage nodes.
  • the storage capacity of the global memory pool may include part of the storage capacity in the storage medium of the computing node and part of the storage capacity in the storage medium of the storage node.
  • the global memory pool is a storage medium that can be accessed by both computing nodes and storage nodes in the supernode through unified addressing.
  • the storage capacity of the global memory pool can be used by computing nodes or storage nodes through memory interfaces such as large memory, distributed data structures, data caches, and metadata. Compute nodes running applications can use these memory interfaces to perform memory operations on the global memory pool.
  • the global memory pool constructed based on the storage capacity of the storage medium of the computing node and the storage medium of the storage node provides a unified memory interface for the computing nodes to use in the north direction, allowing the computing nodes to use the unified memory interface to write data into the global memory pool.
  • the storage space provided by the computing node or the storage space provided by the storage node realizes the calculation and storage of data based on memory operation instructions, reduces the delay of data processing, and increases the speed of data processing.
  • the above description takes the storage medium in the computing node and the storage medium in the storage node to construct a global memory pool as an example.
  • the deployment method of the global memory pool can be flexible and changeable, and is not limited in the embodiments of this application.
  • the global memory pool is built from the storage media of the storage nodes.
  • the global memory pool is constructed from the storage media of computing nodes. Using the storage media of separate storage nodes or the storage media of computing nodes to build a global memory pool can reduce the occupation of storage resources on the storage side and provide a more flexible expansion solution.
  • the storage media of the global memory pool provided by the embodiment of this application include dynamic random access memory (Dynamic Random Access Memory, DRAM), solid state drive (Solid State Disk or Solid State Drive, SSD) and storage level Memory (storage-class-memory, SCM).
  • DRAM Dynamic Random Access Memory
  • SSD Solid State Disk or Solid State Drive
  • SCM storage level Memory
  • the global memory pool can be set according to the type of storage medium, that is, one type of storage medium is used to construct a memory pool, and different types of storage media construct different types of global memory pools, so that the global memory pool can be used in
  • the computing node selects storage media based on the access characteristics of the application, which enhances the user's control authority over the system, improves the user's system experience, and expands the applicable application scenarios of the system.
  • the DRAM in the computing node and the DRAM in the storage node are uniformly addressed to form a DRAM memory pool.
  • the DRAM memory pool is used in application scenarios that require high access performance, moderate data capacity, and no data persistence requirements.
  • the SCM in the computing node and the SCM in the storage node are uniformly addressed to form an SCM memory pool.
  • the SCM memory pool is used in application scenarios that are not sensitive to access performance, have large data capacity, and require data persistence.
  • FIG 3 is a schematic flowchart of a method for storing images provided by this application.
  • node 110A requests a memory image as an example.
  • the method includes the following steps.
  • Step 310 The node 110A sends the memory mirroring requirement to the management node 120.
  • the node 110A can send a memory mirroring request to the management node 120, requesting memory mirroring of the first area where data is stored, that is, the management node 120 allocates a second area that is the same size as the first area, that is, the first area.
  • the area is the area to be mirrored
  • the second area is the mirror area of the first area.
  • the second area is used to indicate the storage space in the second node that is the same size as the first area.
  • the mirror area backs up and stores the data stored in the area to be mirrored.
  • Data that needs to be backed up can include virtual machines (VMs), containers, high-available (HA) applications and business needs.
  • VMs virtual machines
  • HA high-available
  • Business requirements can indicate the need for backup and storage of important data during business execution. That is, the data that needs to be backed up is stored in the area to be mirrored and the mirrored area. If a fault occurs in the area to be mirrored or an error occurs in the data stored in the area to be mirrored, the data can be obtained from the mirrored area, thereby improving the reliability of the data and avoiding business problems and affecting user experience due to storage space failure or data errors in storing data. .
  • the memory mirroring requirement may be sent to the management node 120 according to the mirroring policy.
  • Mirroring policy instructions determine memory mirroring requirements based on the application's reliability level. Reliability indicates the nature of a product that does not malfunction during use. For a product, the higher the reliability of the product, the longer the product can work without failure. For example, the system administrator can pre-configure the reliability level of the application.
  • the node 110A sends memory mirroring requirements according to the reliability level of the application. For applications with high reliability requirements, apply to the management node 120 for memory mirroring. For applications with low reliability requirements, application, there is no need to apply for a memory image from the management node 120.
  • Step 320 The management node 120 obtains the memory mirroring requirement.
  • the management node 120 may receive the memory mirroring requirement sent by the node 110A through the optical fiber connecting the node 110A.
  • the memory mirroring requirement is used to indicate the area to be mirrored in the memory used by node 110A.
  • the memory used by node 110A includes at least one of a local storage medium, an extended storage medium, and a global memory pool. Understandably, the area to be mirrored that the node 110A requests for memory mirroring can be the storage space in any one of the local storage media, the extended storage media, and the global memory pool of the node 110A.
  • the memory mirroring requirement specifically indicates the physical address of the area to be mirrored and the size of the area to be mirrored, so that the management node 120 can directly obtain the size of the area to be mirrored from the memory mirroring requirement.
  • the memory mirroring requirement includes a physical address segment of the area to be mirrored.
  • the management node 120 determines the size of the area to be mirrored based on the physical address segment.
  • the memory mirroring requirement includes the physical address and offset address of the area to be mirrored.
  • the management node 120 determines the size of the area to be mirrored based on the physical address and offset address of the area to be mirrored.
  • Step 330 The management node 120 allocates a mirror area according to memory mirror requirements.
  • the management node 120 determines a free storage medium from the storage media it manages, and divides an area from the free storage medium that is the same size as the area to be mirrored as a mirror area.
  • the storage media managed by the management node 120 includes the local storage media of any node in the system, extended storage media, and storage media that constitute a global memory pool.
  • the storage medium to which the mirror area belongs can be any storage medium in the system, and the relationship between the storage medium to which the mirror area belongs and the storage medium to which the area to be mirrored belongs is not limited.
  • the free storage medium may be a storage medium that is far away from the storage medium to which the area to be mirrored belongs.
  • the storage medium to which the mirror area belongs and the storage medium to which the area to be mirrored belongs can be located in different computer rooms or different cabinets. Therefore, the mirror area and the area to be mirrored are further apart, that is, the mirror area is allocated from a storage medium that is different from the storage medium to which the area to be mirrored belongs, so as to avoid the mirror area and the area to be mirrored being deployed on the same storage medium.
  • the mirror area fails at the same time, thereby reducing the possibility of the mirror area and the area to be mirrored failing at the same time, and improving the reliability of memory mirroring.
  • the management node 120 divides an area with the same size as the area to be mirrored from the node 110B as a mirror area.
  • the node 110A and the node 110B may be two independent physical devices. The distance between the node 110A and the node 110B is relatively long. The node 110A and the node 110B may be located in different computer rooms or different cabinets.
  • the management node 120 can also determine the number of allocated mirror areas according to the reliability level, that is, the management node 120 allocates a different number of mirror areas according to the reliability level from high to low, so as to achieve multi-part backup for high-reliability data. Effect, Ensure data reliability.
  • the reliability levels include reliability level 1 to reliability level 5 from low to high.
  • the management node 120 allocates a mirroring area according to the reliability level 1 indicated by the memory mirroring requirement.
  • the management node 120 allocates two mirroring areas according to the reliability level 2 indicated by the memory mirroring requirement.
  • the storage medium includes any one of DRAM, SSD and SCM.
  • this application does not limit the size of the area to be mirrored, that is, it does not limit the memory mirroring granularity.
  • the management node 120 can perform memory mirroring on a storage area of any size, thereby improving storage resource utilization by performing memory mirroring according to memory mirroring requirements. Avoid statically configuring the mirror area. If the mirror area is too large, memory storage resources are wasted; if the mirror area is too small, memory UCE cannot be solved.
  • the memory mirroring granularity is larger than the memory interleaving granularity, and a fault in the mirroring area will cause multiple data accesses to the memory in an interleaved manner to be affected, reducing the utilization of storage resources.
  • the memory image granularity can be 64 bytes (Bytes), matching the memory interleaving granularity, thereby avoiding additional memory waste caused by the expansion of the interleaved storage area isolation.
  • the management node 120 may construct a mirroring relationship between the area to be mirrored and the mirroring area, so that the management node 120 determines the mirroring area according to the mirroring relationship and performs read operations or write operations on the mirroring area.
  • the mirroring relationship between the area to be mirrored and the mirroring area indicates the corresponding relationship between the physical address of the area to be mirrored and the physical address of the mirroring area.
  • the mirroring relationship can be presented in the form of a table, as shown in Table 1.
  • the physical address 1 of the area to be mirrored corresponds to the physical address 2 of the mirror area.
  • the management node 120 looks up the table according to the physical address 1 of the area to be mirrored, and determines that the physical address of the mirror area is physical address 2. According to the physical address of the mirror area, Physical address 2 performs read or write operations on the mirror area.
  • Table 1 only illustrates the storage form of the corresponding relationship in the storage device in the form of a table, and does not limit the storage form of the corresponding relationship in the storage device.
  • the storage form of the corresponding relationship in the storage device The form can also be stored in other forms, which is not limited in this embodiment.
  • Step 340 The management node 120 feeds back a mirroring success response to the node 110A.
  • the management node 120 After the management node 120 allocates a mirror area with the same size as the area to be mirrored according to the memory mirroring requirements, it feeds back a mirroring success response to the node 110A.
  • the node 110A can generate a mirror identification of the area to be mirrored, and the mirror identification indicates that the area to be mirrored is a successfully mirrored area and is a clone.
  • the node 110A can also generate a mapping relationship between the virtual address (Virtual Address, VA) of the area to be mirrored and the physical address (Physical Address, PA) of the area to be mirrored, so that the node 110A determines the address of the area to be mirrored based on the virtual address of the area to be mirrored. Physical address, to perform read or write operations on the area to be mirrored.
  • VA Virtual Address
  • PA Physical Address
  • the storage resources of the memory mirroring can be released when the business execution in the system is completed, virtual machines, containers, and other high-reliability data do not need to be backed up.
  • the application also includes step 350.
  • Step 350 The management node 120 sends a memory image release instruction to the node 110A and the node 110B.
  • the management node 120 may receive a memory image release request from the node 110A.
  • the memory image release request indicates the area to be mirrored that is requested to be released.
  • the memory image release request includes the physical address of the area to be mirrored and the size of the area to be mirrored.
  • the memory mirror release request includes the physical address segment of the area to be mirrored.
  • the memory mirror release request includes the physical address and offset address of the area to be mirrored.
  • the management node 120 determines that the to-be-mirrored area of the node 110A and the mirroring area of the node 110B are not used during the monitoring period, and the management node 120 determines to release the to-be-mirrored area of the node 110A and the mirroring area of the node 110B, so that The area to be mirrored and the mirrored area can be used to store other data to improve storage resource utilization.
  • the first memory image release instruction sent by the management node 120 to the node 110A includes the physical address of the area to be mirrored.
  • the management node 120 sends the second memory image release instruction to the node 110B, and the second memory image release instruction includes the physical address of the mirror area.
  • Node 110A releases the area to be mirrored according to the first memory image release instruction, or modifies the image identification of the area to be mirrored to be invalid.
  • Node 110B releases the mirror area according to the second memory image release instruction, or modifies the mirror identification of the mirror area to be invalid.
  • the memory mirroring method provided by this application does not depend on the operating system of the node.
  • the management node dynamically allocates the mirror area according to the memory mirroring requirements to implement memory mirroring. There is no need to restart the host that configures the memory mirroring; when memory mirroring is not needed, the storage of the memory mirroring is dynamically released. resources, thereby achieving simpler and more efficient dynamic memory mirroring and improving storage resource utilization.
  • FIG 4 is a schematic flow chart of a data processing method provided by this application.
  • the node 110A performs write operations and read operations on the area to be mirrored as an example. As shown in Figure 4, the method includes the following steps.
  • Step 410 The node 110A sends a write instruction to the management node 120.
  • the write instruction is used to instruct the first data to be stored in the area to be mirrored.
  • node 110A queries the address mapping table according to the virtual address of the area to be mirrored to determine the physical address of the area to be mirrored, and the write instruction includes the physical address of the area to be mirrored.
  • the address mapping table indicates the mapping relationship between virtual addresses and physical addresses.
  • Step 420 The management node 120 writes the first data into the area to be mirrored and the mirroring area.
  • the management node 120 After obtaining the write instruction, the management node 120 writes the first data into the area to be mirrored according to the physical address of the area to be mirrored included in the write instruction.
  • the management node 120 supports cache consistency protocols such as CXL3.0 and p2p mode, and the management node 120 writes the first data into the mirror area.
  • the management node 120 queries the mirroring relationship according to the physical address of the area to be mirrored, determines the physical address of the mirroring area, and writes the first data into the mirroring area according to the physical address of the mirroring area.
  • Step 430 The node 110A sends a read instruction to the management node 120.
  • the read instruction is used to instruct reading the first data from the area to be mirrored.
  • the node 110A queries the address mapping table according to the virtual address of the area to be mirrored to determine the physical address of the area to be mirrored, and the read instruction includes the physical address of the area to be mirrored.
  • step 440 When no uncorrectable error occurs in the area to be mirrored, step 440 is executed. When an uncorrectable error occurs in the area to be mirrored, step 450 is executed.
  • Step 440 The management node 120 reads the first data from the area to be mirrored. The management node 120 feeds back the first data to the node 110A.
  • Step 450 The management node 120 reads the first data from the mirror area.
  • the management node 120 determines that an uncorrectable error has occurred in the area to be mirrored, queries the mirroring relationship according to the physical address of the area to be mirrored, determines the physical address of the mirror area of the area to be mirrored, and reads the first data from the mirror area based on the physical address of the mirror area.
  • the management node 120 After the management node 120 reads the first data from the area to be mirrored or reads the first data from the mirror area, it feeds back the first data to the node 110A.
  • node 110A after node 110A reads data from the area to be mirrored, it verifies the data read from the area to be mirrored to determine that an error occurs in the read data. For example, the data read from the area to be mirrored is not the first one. One data. If the node 110A cannot correct the read error data using the ECC technology, it instructs the management node 120 to read the first data from the mirror area, that is, step 450 is executed.
  • the management node 120 supports cache consistency protocols such as CXL3.0 and p2p mode. After reading the first data from the mirror area, it writes the first data into the area to be mirrored.
  • the management node 120 does not support cache consistency protocols such as CXL3.0 and p2p mode.
  • the management node 120 feeds back the first data read from the mirror area to the node 110A.
  • the node 110A requests the management node 120 to write the first data into the area to be mirrored. .
  • the first data is successfully written to the area to be mirrored, it means that there is no hardware failure in the area to be mirrored, and it may be an accidental data error. If the writing of the first data to the area to be mirrored fails, it means that a hardware failure occurs in the area to be mirrored, and the area to be mirrored and the mirrored area are started to perform a master/backup switchover.
  • the management node 120 can perform active/standby switching of the area to be mirrored and the mirrored area. For example, the management node 120 determines the mirror area as the main storage space. Therefore, the node 110 is caused to perform a read operation or a write operation on the first data.
  • the management node includes hardware structures corresponding to each function. structures and/or software modules.
  • the units and method steps of each example described in conjunction with the embodiments disclosed in this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software driving the hardware depends on the specific application scenarios and design constraints of the technical solution.
  • FIG 5 is a schematic structural diagram of a possible management device provided by this embodiment. These management devices can be used to implement the functions of the management nodes in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
  • the management device may be the management node 120 as shown in Figure 3 or Figure 4, or it may be a module (such as a chip) applied to the server.
  • the management device 500 includes a communication module 510 , a control module 520 and a storage module 530 .
  • the management device 500 is used to implement the functions of the management node 120 in the method embodiment shown in FIG. 3 or FIG. 4 .
  • the communication module 510 is configured to receive the memory mirroring requirement of the first node and request to mirror the first area in the memory used by the first node. For example, the communication module 510 is used to perform step 320 in FIG. 3 .
  • the control module 520 is configured to allocate a second area when the first node requests to mirror the first area in the memory used by the first node, and the second area is the mirror area of the first area. , the second area is used to indicate a storage space in the second node that is the same size as the first area, and the second area is used to back up and store data in the first area. For example, the control module 520 is used to execute step 330 in FIG. 3 .
  • the control module 520 is also configured to generate a mirror relationship between the first area and the second area.
  • the mirror relationship is used to indicate the corresponding relationship between the first physical address and the second physical address.
  • the second physical address The address is used to indicate the second area.
  • the communication module 510 is also used to receive a write operation or a read operation on the first area.
  • the communication module 510 is used to perform step 340 in FIG. 3 .
  • the communication module 510 is used to perform step 420, step 440 and step 450 in Figure 4.
  • the control module 520 is also used to perform write operations or read operations on the first area and the second area according to the mirroring relationship.
  • the communication module 510 is also used to feedback the success of mirroring to the node.
  • the communication module 510 is used to perform step 340 in FIG. 3 .
  • the communication module 510 is also used to send a memory image release request to the node.
  • the communication module 510 is used to perform step 350 in FIG. 3 .
  • the storage module 530 is used to store the mirror relationship so that the control module 520 can access the mirror area according to the mirror relationship.
  • the management device 500 in the embodiment of the present application can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • the above PLD can be complex program logic.
  • Device complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • each module thereof can also be a software module
  • each module of the management device 500 can also be a software module.
  • the management device 500 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the management device 500 are respectively to implement each of the steps in Figure 3 or Figure 4 The corresponding process of the method will not be repeated here for the sake of brevity.
  • FIG. 6 is a schematic structural diagram of a computing device 600 provided in this embodiment.
  • computing device 600 includes a processor 610, a bus 620, a memory 630, a communication interface 640, and a memory unit 650 (which may also be referred to as a main memory unit).
  • the processor 610, the memory 630, the memory unit 650 and the communication interface 640 are connected through a bus 620.
  • the processor 610 can be a CPU, and the processor 610 can also be other general-purpose processors, digital signal processors (digital signal processing, DSP), ASICs, FPGAs or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the processor can also be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an ASIC, or one or more integrations used to control the execution of the program of this application. circuit.
  • GPU graphics processing unit
  • NPU neural network processing unit
  • ASIC application specific integrated circuit
  • the communication interface 640 is used to implement communication between the computing device 600 and external devices or devices. In this embodiment, when the computing device 600 is used to implement the functions of the management node 120 shown in Figure 1, the communication interface 640 is used to obtain memory mirroring requirements, and the processor 610 Allocate mirror area. When the computing device 600 is used to implement the functions of the node 110 shown in Figure 1, the communication interface 640 is used to send memory mirroring requirements.
  • Bus 620 may include a path for communicating information between the components described above, such as processor 610, memory unit 650, and storage 630.
  • the bus 620 may also include a power bus, a control bus, a status signal bus, etc.
  • the various buses are labeled bus 620 in the figure.
  • the bus 620 may be a Peripheral Component Interconnect Express (PCIe) bus, an extended industry standard architecture (EISA) bus, a computer express link (CXL), or a cache-coherent interconnect protocol. (cache coherent interconnect for accelerators, CCIX), etc.
  • PCIe Peripheral Component Interconnect Express
  • EISA extended industry standard architecture
  • CXL computer express link
  • cache-coherent interconnect protocol cache coherent interconnect for accelerators, CCIX
  • the bus 620 can be divided into an address bus, a data bus, a control bus, etc.
  • computing device 600 may include multiple processors.
  • the processor may be a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or computing units for processing data (eg, computer program instructions).
  • the processor 610 is also used to when the first node requests the first memory in the memory used by the first node.
  • a region is mirrored, a second region is allocated.
  • the second region is a mirror region of the first region.
  • the second region is used to indicate the storage space in the second node that is the same size as the first region.
  • the second area is used to back up and store data in the first area.
  • the processor 610 is also used to request a write operation or a read operation on the area for which the mirror has been applied.
  • the processor 610 is also used to perform write operations or read operations on the mirror area according to the mirror relationship.
  • FIG. 6 only takes the computing device 600 including a processor 610 and a memory 630 as an example.
  • the processor 610 and the memory 630 are respectively used to indicate a type of device or device.
  • the quantity of each type of device or equipment can be determined based on business needs.
  • the memory unit 650 may be used to store the mirroring relationship in the above method embodiment.
  • Memory unit 650 may be a pool of volatile or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • the memory 630 may correspond to the storage medium used to store computer instructions, memory operation instructions, node identifiers and other information in the above method embodiments, for example, a magnetic disk, such as a mechanical hard disk or a solid state hard disk.
  • computing device 600 may be a general-purpose device or a special-purpose device.
  • computing device 600 may be an edge device (eg, a box carrying a chip with processing capabilities), or the like.
  • the computing device 600 may also be a server or other device with computing capabilities.
  • the computing device 600 may correspond to the management device 500 in this embodiment, and may correspond to the corresponding subject executing any method according to Figure 3 or Figure 4, and each of the management devices 500
  • the above and other operations and/or functions of the module are respectively intended to implement the corresponding processes of each method in Figure 3 or Figure 4. For the sake of simplicity, they will not be described again here.
  • the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules.
  • Software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media.
  • An exemplary storage medium is coupled to the processor such that the processor The processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in a computing device. Of course, the processor and storage medium may also exist as discrete components in a computing device.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
  • the available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD).
  • SSD solid state drives

Abstract

公开了数据处理系统、内存镜像方法、装置和计算设备,涉及计算机领域。系统包括多个节点和管理节点。第一节点请求对第一节点所使用的内存中第一区域进行镜像;管理节点分配第二区域,第二区域用于指示第二节点中与第一区域的大小相同的存储空间,第二区域用于备份存储第一区域的数据。在节点没有提出内存镜像需求时,系统中的存储资源用于存储不同的数据,仅在提出内存镜像需求时,才从系统的存储资源中分配镜像区域,使镜像区域备份存储待镜像区域存储的数据,提升数据高可靠性。另外,待镜像区域和镜像区域可以是不同节点内的存储空间,从而,灵活动态地分配镜像区域实现内存镜像,提升内存镜像配置的灵活性以及存储资源的利用率。

Description

数据处理系统、内存镜像方法、装置和计算设备
本申请要求于2022年09月09日提交国家知识产权局、申请号为202211105202.3,申请名称为“一种内存镜像的实现方法”的中国专利申请的优先权,本申请还要求于2022年11月30日提交国家知识产权局、申请号为202211519995.3,申请名称为“数据处理系统、内存镜像方法、装置和计算设备”的中国专利申请的优先权,这些全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种数据处理系统、内存镜像方法、装置和计算设备。
背景技术
内存镜像(mirror)是解决内存的不可纠正错误(Uncorrectable Error,UCE)的有效手段,即将内存中一部分存储空间作为另一部分存储空间的镜像区域存储备份数据。通常,采用静态配置方式或由操作系统分配内存中相邻页作为镜像区域实现内存镜像。如果镜像区域太大,导致浪费内存的存储资源。如果镜像区域太小,导致无法解决内存的UCE。因此,目前内存镜像配置不灵活,导致存储资源的利用率较低。
发明内容
本申请提供了数据处理系统、内存镜像方法、装置和计算设备,由此实现灵活地配置内存镜像,提升内存的存储资源的利用率。
第一方面,提供了一种数据处理系统,数据处理系统包括多个节点和管理节点。其中,第一节点,用于请求对第一节点所使用的内存中第一区域进行镜像;管理节点,用于分配第二区域,即第一区域为待镜像区域,第二区域为第一区域的镜像区域,第二区域用于指示第二节点中与第一区域的大小相同的存储空间,第二区域用于备份存储第一区域的数据。
相对于采用静态配置方式在系统启动前预先配置镜像区域,浪费存储资源,本申请提供的方案在节点没有提出内存镜像需求时,系统中的存储资源用于存储不同的数据,仅在提出内存镜像需求时,才从系统的存储资源中分配镜像区域,使镜像区域备份存储待镜像区域存储的数据,提升数据高可靠性。另外,相对于由操作系统分配内存中相邻页作为镜像区域实现内存镜像,本申请提供的方案不限定待镜像区域和镜像区域的位置关系,待镜像区域和镜像区域可以是不同节点内的存储空间,从而,灵活动态地分配镜像区域实现内存镜像,提升内存镜像配置的灵活性以及存储资源的利用率。
结合第一方面,在一种可能的实现方式中,第一节点指示了第一区域的第一物理地址;管理节点,还用于生成第一区域和第二区域的镜像关系,镜像关系用于指示第一物理地址与第二物理地址的对应关系,第二物理地址用于指示第二区域。从而,在第一节点对第一区域进行读操作或写操作时,以便于管理节点根据镜像关系确定第一区域的镜像区域,对第一区域的镜像区域进行写操作,或当第一区域发生不可纠正错误时,从第二区域读取第一数据,避免出现数据处理失败的现象。
在一种示例中,管理节点,还用于接收第一节点发送的写指示,将第一数据写入第一区域和第二区域。写指示用于指示将第一数据存储到第一区域。
在另一种示例中,管理节点,还用于接收第一节点的读指示,读指示用于指示从第一区域读取第一数据;管理节点,还用于当第一区域未发生不可纠正错误时,从第一区域读取第一数据。或者,管理节点,还用于当第一区域发生不可纠正错误时,从第二区域读取第一数据,从而使第一节点成功读取第一数据,避免所需第一数据的业务受到影响。
结合第一方面,在另一种可能的实现方式中,第一区域为主存储空间,第二区域为备存储空间;管理节点,还用于当第一区域发生不可纠正错误时,将第二区域确定为主存储空间。
结合第一方面,在另一种可能的实现方式中,管理节点,还用于指示第一节点将第一区域的 镜像标识修改为无效。从而,以便于节点释放第一区域的存储资源,提升存储资源的利用率。
结合第一方面,在另一种可能的实现方式中,第一区域的大小是由应用需求确定的。
结合第一方面,在另一种可能的实现方式中,第二区域包括第二节点的本地存储空间、第二节点的扩展存储空间和全局内存池中第二节点的存储空间中任一种。
结合第一方面,在另一种可能的实现方式中,管理节点支持缓存一致性协议。
第二方面,提供一种内存镜像方法,数据处理系统包括多个节点和管理节点;方法包括:第一节点请求对第一节点所使用的内存中第一区域进行镜像;管理节点分配第二区域,第二区域为第一区域的镜像区域,第二区域用于指示第二节点中与第一区域的大小相同的存储空间,第二区域用于备份存储第一区域的数据。
结合第二方面,在一种可能的实现方式中,第一节点指示了第一区域的第一物理地址;方法还包括:管理节点生成第一区域和第二区域的镜像关系,镜像关系用于指示第一物理地址与第二物理地址的对应关系,第二物理地址用于指示第二区域。
结合第二方面,在另一种可能的实现方式中,方法还包括:管理节点接收第一节点发送的写指示,写指示用于指示将第一数据存储到第一区域;管理节点将第一数据写入第一区域和第二区域。
结合第二方面,在另一种可能的实现方式中,方法还包括:管理节点接收第一节点的读指示,读指示用于指示从第一区域读取第一数据;管理节点当第一区域未发生不可纠正错误时,从第一区域读取第一数据。
结合第二方面,在另一种可能的实现方式中,方法还包括:管理节点当第一区域发生不可纠正错误时,从第二区域读取第一数据。
结合第二方面,在另一种可能的实现方式中,第一区域为主存储空间,第二区域为备存储空间,方法还包括:管理节点当第一区域发生不可纠正错误时,将第二区域确定为主存储空间。
结合第二方面,在另一种可能的实现方式中,方法还包括:管理节点指示第一节点将第一区域的镜像标识修改为无效。
结合第二方面,在另一种可能的实现方式中,第一区域的大小是由应用需求确定的。
结合第二方面,在另一种可能的实现方式中,第二区域包括第二节点的本地存储空间、第二节点的扩展存储空间和全局内存池中第二节点的存储空间中任一种。
结合第二方面,在另一种可能的实现方式中,管理节点支持缓存一致性协议。
第三方面,提供了一种管理装置,所述装置包括用于执行第二方面或第二方面任一种可能设计中的管理节点执行的方法的各个模块。
第四方面,提供了一种数据处理节点,所述节点包括用于执行第二方面或第二方面任一种可能设计中的节点执行的方法的各个模块。
第五方面,提供一种计算设备,该计算设备包括至少一个处理器和存储器,存储器用于存储一组计算机指令;当处理器作为第二方面或第二方面任一种可能实现方式中的管理节点执行所述一组计算机指令时,执行第二方面或第二方面任一种可能实现方式中的内存镜像方法的操作步骤。
第六方面,提供一种芯片,包括:处理器和供电电路;其中,所述供电电路用于为所述处理器供电;所述处理器用于执行第二方面或第二方面任一种可能实现方式中的内存镜像方法的操作步骤。
第七方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在计算设备中运行时,使得计算设备执行如第二方面或第二方面任意一种可能的实现方式中所述方法的操作步骤。
第八方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算设备执行如第二方面或第二方面任意一种可能的实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的一种数据处理系统的架构示意图;
图2为本申请提供的一种全局内存池的部署场景示意图;
图3为本申请提供的一种内存镜像方法的流程示意图;
图4为本申请提供的一种数据处理方法的流程示意图;
图5为本申请提供的一种管理装置的结构示意图;
图6为本申请提供的一种计算设备的结构示意图。
具体实施方式
为了便于描述,首先对本申请涉及的术语进行简单介绍。
内存(memory)也称内存储器和主存储器(main memory)。内存是计算机系统的重要部件,即外部存储器(或称为辅助存储器)与中央处理器(central processing unit,CPU)进行沟通的桥梁。内存用于暂时存放CPU中的运算数据以及CPU与硬盘等外部存储器交换的数据。例如,计算机开始运行,将需要运算的数据从内存加载到CPU中进行运算,运算完成后,CPU将运算结果存入内存。
可纠正错误(Correctable Error,CE),是指可以采用纠错码(Error Correction Code,ECC)技术纠正的内存错误,确保主机的可用可靠可维护(Reliability,Availability and Serviceability,RAS)。
不可纠正错误(Uncorrectable Error,UCE),是指当内存错误超过ECC的纠错能力,无法采用ECC技术纠正内存错误。如果内存中发生不可纠正错误的存储空间已配置镜像区域,可以从镜像区域获取该存储空间的备份数据。
全局镜像(Global Mirror),是指将内存中一半的存储空间作为另一半存储空间的镜像区域,用于备份存储另一半存储空间存储的数据。
局部镜像也称基于地址区间的内存地址镜像,是指将内存中一个地址段指示的存储空间中一半的区域作为另一半区域的镜像区域。
缓存线(cacheline),指计算机设备对内存的存储空间进行读操作或写操作的单位。一个缓存线的大小可以为64字节(byte,B)。
交织,指将访问内存的数据按照单位存储空间(例如,缓存线)均匀地分布到多个内存通道上。交织方式可以由系统管理员配置,可以在一个处理器连接的多个内存通道之间进行交织,也可以在多个处理器的多个内存通道之间进行交织。
内存通道,指计算机设备中处理器连接的多个内存。处理器可以采用交织技术对内存进行操作。例如,处理器根据缓存线的大小将待写入内存的数据均匀地分布到多个内存通道上。进而,处理器根据缓存线的大小从多个内存通道上读取数据。从而,基于多个内存通道进行数据处理,以提升计算机设备的内存带宽利用率和处理性能。
超节点(Super Node),指通过高带宽、低时延的片间互连总线和交换机将多个节点互连成一个高性能集群。超节点的规模大于缓存一致非统一内存寻址(Cache-Coherent Non Uniform Memory Access,CC-NUMA)架构下的节点规模,超节点内节点的互连带宽大于以太网络互连带宽。
高性能计算(High Performance Computing,HPC)集群,指一个计算机集群系统。HPC集群包含利用各种互联技术连接在一起的多个计算机。互联技术例如可以是无限带宽技术(infiniband,IB)、基于聚合以太网的远程直接内存访问(Remote Direct Memory Access over Converged Ethernet,RoCE)或传输控制协议(Transmission Control Protocol,TCP)。HPC提供了超高浮点计算能力,可用于解决计算密集型和海量数据处理等业务的计算需求。连接在一起的多个计算机的综合计算能力可以来处理大型计算问题。例如,科学研究、气象预报、金融、仿真实验、生物制药、基因测序和图像处理等行业涉及的利用HPC集群来解决的大型计算问题和计算需求。利用HPC集群处理大型计算问题可以有效地缩短处理数据的计算时间,以及提高计算精度。
内存操作指令,可以称为内存语义或内存操作函数。内存操作指令包括内存分配(malloc)、内存设置(memset)、内存复制(memcpy)、内存移动(memmove)、内存释放(memory release)和内存比较(memcmp)中至少一种。
内存分配用于支持应用程序运行分配一段内存。
内存设置用于设置全局内存池的数据模式,例如初始化。
内存复制用于将源地址(source)指示的存储空间存储的数据复制到目的地址(destination)指示的存储空间。
内存移动用于将源地址(source)指示的存储空间存储的数据复制到目的地址(destination)指示的存储空间,并删除源地址(source)指示的存储空间存储的数据。
内存比较用于比较两个存储空间存储的数据是否相等。
内存释放用于释放内存中存储的数据,以提高系统内存资源的利用率,进而提升系统性能。
为了解决内存镜像的配置不灵活,导致存储资源的利用率较低的问题,本申请提供一种数据处理系统,数据处理系统包括多个节点和管理节点。当第一节点请求对第一节点所使用的内存中第一区域进行镜像时,管理节点分配第二区域,即第一区域为待镜像区域,第二区域为第一区域的镜像区域,第二区域用于指示第二节点中与第一区域的大小相同的存储空间,第二区域用于备份存储第一区域的数据。相对于采用静态配置方式在系统启动前预先配置镜像区域,浪费存储资源,本申请提供的内存镜像方法在没有提出内存镜像需求时,系统中的存储资源用于存储不同的数据,仅在提出内存镜像需求时,才从系统的存储资源中分配镜像区域,使镜像区域备份存储待镜像区域存储的数据,提升数据高可靠性。另外,相对于由操作系统分配内存中相邻页作为镜像区域实现内存镜像,本申请的方法不限定待镜像区域和镜像区域的位置关系,待镜像区域和镜像区域可以是不同节点内的存储空间,从而,灵活动态地分配镜像区域实现内存镜像,提升内存镜像配置的灵活性以及存储资源的利用率。
图1为本申请提供的一种数据处理系统的架构示意图。如图1所示,数据处理系统100是一种提供高性能计算的实体。数据处理系统100包括多个节点110。节点110可以包括计算节点和存储节点。
例如,节点110可以是处理器、服务器、台式计算机、智能网卡、内存扩展卡、存储阵列的控制器和存储器等。处理器可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、数据处理单元(data processing unit,DPU)、神经处理单元(neural processing unit,NPU)和嵌入式神经网络处理器(neural-network processing unit,NPU)等用于数据处理的XPU。
当节点110是计算能力(Computing Power)较高的GPU、DPU、NPU等数据处理的XPU时,节点110可以作为加速器,将通用处理器(如:CPU)的作业卸载到加速器,由加速器处理计算需求较高的作业(如:HPC、大数据作业、数据库作业等),解决由于通用处理器浮点算力不足,无法满足HPC、人工智能(Artificial Intelligence,AI)等场景的重浮点计算需求的问题,从而,缩短数据处理时长以及降低系统能耗,提升系统性能。节点的计算能力也可以称为节点的计算算力。在一些实施例中,节点110内部也可以集成加速器。独立部署的加速器和集成加速器的节点支持灵活插拔,可以按需弹性扩展数据处理系统的规模,从而满足不同的应用场景下的计算需求。
存储节点包括一个或多个控制器、网卡与多个硬盘。硬盘用于存储数据。硬盘可以是磁盘或者其他类型的存储介质,例如固态硬盘或者叠瓦式磁记录硬盘等。网卡用于与计算集群包含的计算节点通信。控制器用于根据计算节点发送的读/写数据请求,往硬盘中写入数据或者从硬盘中读取数据。在读写数据的过程中,控制器需要将读/写数据请求中携带的地址转换为硬盘能够识别的地址。
多个节点110基于具有高带宽、低时延的高速互连链路连接。在一些实施例中,如图1所示,管理节点120(如:交换机)基于高速互连链路连接多个节点110。例如,管理节点120通过光纤、铜缆或铜线连接多个节点110。管理节点可称为交换芯片或互联芯片或基板管理控制器(Baseboard Management Controller,BMC)。
管理节点120基于高速互连链路连接的多个节点110组成的数据处理系统100也可以称为超节点。多个超节点通过数据中心网络进行连接。数据中心网络包括多个核心交换机和多个汇聚交换机。数据中心网络可以组成一个规模域。多个超节点可以组成一个性能域。两个以上超节点可以组成宏机柜。宏机柜之间也可以基于数据中心网络连接。
管理节点120用于根据节点110发出的内存镜像需求,为节点110所使用的内存中待镜像区域分配与待镜像区域的大小相同的镜像区域。其中,管理节点120可以支持计算快速链接(Compute  Express Link,CXL)等缓存一致性协议,保持内存镜像的高性能、低时延和数据一致性。
在另一些实施例中,多个节点110基于具有高带宽、低时延的高速互连链路进行直接连接。节点110具备本申请提供的管理节点120的功能。
数据处理系统100支持运行大数据、数据库、高性能计算、人工智能、分布式存储和云原生等应用。本申请实施例中需要备份存储的数据包括虚拟机(Virtual Machine,VM)、容器、高可用的(High Available,HA)应用程序、大数据、数据库、高性能计算、人工智能(Artificial Intelligence,AI)、分布式存储和云原生等应用的业务数据。
其中,待镜像区域和镜像区域可以是不同节点内的存储空间。镜像区域可以由系统中任一个节点110的本地存储介质、扩展存储介质或全局内存池提供。
在一些实施例中,数据处理系统100中节点110的存储介质经过统一编址构成全局内存池,实现跨超节点内节点(简称:跨节点)的内存语义访问。全局内存池为由节点的存储介质经过统一编址构成的节点共享的资源。
本申请提供的全局内存池可以包括超节点中计算节点的存储介质和存储节点的存储介质。计算节点的存储介质包括计算节点内的本地存储介质和计算节点连接的扩展存储介质中至少一种。存储节点的存储介质包括存储节点内的本地存储介质和存储节点连接的扩展存储介质中至少一种。
例如,全局内存池包括计算节点内的本地存储介质和存储节点内的本地存储介质。
又如,全局内存池包括计算节点内的本地存储介质、计算节点连接的扩展存储介质,以及存储节点内的本地存储介质和存储节点连接的扩展存储介质中任意一种。
又如,全局内存池包括计算节点内的本地存储介质、计算节点连接的扩展存储介质、存储节点内的本地存储介质和存储节点连接的扩展存储介质。
示例地,如图2所示,为本申请提供的一种全局内存池的部署场景示意图。全局内存池200包括N个计算节点中每个计算节点内的存储介质210、N个计算节点中每个计算节点连接的扩展存储介质220、M个存储节点中每个存储节点内的存储介质230和M个存储节点中每个存储节点连接的扩展存储介质240。
应理解,全局内存池的存储容量可以包括计算节点的存储介质中的部分存储容量和存储节点的存储介质中的部分存储容量。全局内存池是经过统一编址的超节点内计算节点和存储节点均可以访问的存储介质。全局内存池的存储容量可以通过大内存、分布式数据结构、数据缓存、元数据等内存接口供计算节点或存储节点使用。计算节点运行应用程序可以使用这些内存接口对全局内存池进行内存操作。如此,基于计算节点的存储介质的存储容量和存储节点的存储介质构建的全局内存池北向提供了统一的内存接口供计算节点使用,使计算节点使用统一的内存接口将数据写入全局内存池的计算节点提供的存储空间或存储节点提供的存储空间,实现基于内存操作指令的数据的计算和存储,以及降低数据处理的时延,提升数据处理的速度。
上述是以计算节点内的存储介质和存储节点内的存储介质构建全局内存池为例进行说明。全局内存池的部署方式可以灵活多变,本申请实施例不予限定。例如,全局内存池由存储节点的存储介质构建。又如,全局内存池由计算节点的存储介质构建。使用单独的存储节点的存储介质或计算节点的存储介质构建全局内存池可以减少存储侧的存储资源的占用,以及提供更灵活的扩展方案。
依据存储介质的类型划分,本申请实施例提供的全局内存池的存储介质包括动态随机存取存储器(Dynamic Random Access Memory,DRAM)、固态驱动器(Solid State Disk或Solid State Drive,SSD)和存储级内存(storage-class-memory,SCM)。
在一些实施例中,可以根据存储介质的类型设置全局内存池,即利用一种类型的存储介质构建一种内存池,不同类型的存储介质构建不同类型的全局内存池,使全局内存池应用于不同的场景,计算节点根据应用的访问特征选择存储介质,增强了用户对系统控制权限,提升了用户的系统体验又扩展了系统适用的应用场景。例如,将计算节点中的DRAM和存储节点中的DRAM进行统一编址构成DRAM内存池。DRAM内存池用于对访问性能要求高,数据容量适中,无数据持久化诉求的应用场景。又如,将计算节点中的SCM和存储节点中的SCM进行统一编址构成SCM内存池。SCM内存池则用于对访问性能不敏感,数据容量大,对数据持久化有诉求的应用场景。
接下来,结合图3至图4对本申请提供的内存镜像方法的实施方式进行详细描述。
图3为本申请提供的一种存镜像方法的流程示意图。在这里以节点110A请求内存镜像为例进行说明。如图3所示,该方法包括以下步骤。
步骤310、节点110A向管理节点120发送内存镜像需求。
为了提高数据可靠性,节点110A可以向管理节点120发送内存镜像需求,请求对存储数据的第一区域进行内存镜像,即管理节点120分配与第一区域的大小相同的第二区域,即第一区域为待镜像区域,第二区域为第一区域的镜像区域,第二区域用于指示第二节点中与第一区域的大小相同的存储空间,由镜像区域备份存储待镜像区域存储的数据。
需要进行备份的数据可以包括虚拟机(Virtual Machine,VM)、容器、高可用的(High Available,HA)应用程序和业务需求。业务需求可以指示业务执行过程中重要数据进行备份存储的需求。也就是,需要进行备份的数据存储到待镜像区域和镜像区域。如果待镜像区域发生故障或者待镜像区域存储的数据发生错误,可以从镜像区域获取数据,从而提高数据的可靠性,避免由于存储数据的存储空间故障或数据错误,导致业务出现问题,影响用户体验。
在一些实施例中,节点110A启动后,可以根据镜像策略向管理节点120发送内存镜像需求。镜像策略指示依据应用的可靠性等级确定内存镜像需求。可靠性指示产品在使用期间没有发生故障的性质。对产品而言,产品的可靠性越高,产品可以无故障工作的时间就越长。例如,系统管理员可以预先配置应用的可靠性等级,节点110A根据应用的可靠性等级发送内存镜像需求,对于具有高可靠性要求的应用,向管理节点120申请内存镜像,对于低可靠性要求的应用,无需向管理节点120申请内存镜像。
步骤320、管理节点120获取内存镜像需求。
管理节点120可以通过连接节点110A的光纤接收节点110A发送的内存镜像需求。内存镜像需求用于指示节点110A所使用的内存中待镜像区域。
节点110A所使用的内存包括本地存储介质、扩展存储介质和全局内存池中至少一种。可理解地,节点110A请求进行内存镜像的待镜像区域可以是节点110A的本地存储介质、扩展存储介质和全局内存池中任一种存储介质中的存储空间。
其中,内存镜像需求具体指示了待镜像区域的物理地址和待镜像区域的大小,以便于管理节点120直接从内存镜像需求中获取待镜像区域的大小。
在一种示例中,内存镜像需求包括待镜像区域的物理地址段。管理节点120根据物理地址段确定待镜像区域的大小。
在另一种示例中,内存镜像需求包括待镜像区域的物理地址和偏移地址。管理节点120根据待镜像区域的物理地址和偏移地址确定待镜像区域的大小。
步骤330、管理节点120根据内存镜像需求分配镜像区域。
管理节点120从其所管理的存储介质中确定一个空闲存储介质,从空闲存储介质中划分一个与待镜像区域的大小相同的区域作为镜像区域。管理节点120所管理的存储介质包括系统中任一节点的本地存储介质、扩展存储介质和构成全局内存池的存储介质。
另外,镜像区域所属的存储介质可以系统中任一个存储介质,对镜像区域所属的存储介质与待镜像区域所属的存储介质的关系不予限定。空闲存储介质可以是与待镜像区域所属的存储介质距离较远的存储介质。例如,镜像区域所属的存储介质和待镜像区域所属的存储介质可以位于不同的机房或不同的机柜。从而,将镜像区域和待镜像区域拉远,即从与待镜像区域所属的存储介质不同的存储介质分配镜像区域,避免由于镜像区域和待镜像区域部署在同一个存储介质,导致镜像区域和待镜像区域同时失效,从而,降低镜像区域和待镜像区域同时失效的可能性,提高内存镜像的可靠性。
假设管理节点120从节点110B中划分一个与待镜像区域的大小相同的区域作为镜像区域。节点110A和节点110B可以是两个独立的物理设备,节点110A和节点110B之间的距离较远,节点110A和节点110B可以位于不同的机房或不同的机柜。
可选地,管理节点120也可以根据可靠性等级确定分配镜像区域的数量,即管理节点120根据从高到低的可靠性等级分配数量不同的镜像区域,以对高可靠性的数据实现多分备份的效果, 确保数据的可靠性。例如,可靠性等级包括从低到高的可靠性等级1至可靠性等级5。当内存镜像需求指示了可靠性等级1时,管理节点120根据内存镜像需求指示的可靠性等级1分配一个镜像区域。当内存镜像需求指示了可靠性等级2时,管理节点120根据内存镜像需求指示的可靠性等级2分配两个镜像区域。
本申请对镜像区域所属的存储介质和待镜像区域所属的存储介质的类型不予限定,例如存储介质包括DRAM、SSD和SCM中任一种。
另外,本申请对待镜像区域的大小不予限定,即不限定内存镜像粒度。管理节点120可以对任意大小的存储区域进行内存镜像,从而,根据内存镜像需求进行内存镜像提高存储资源的利用率。避免采用静态配置镜像区域时,镜像区域太大,导致浪费内存的存储资源;镜像区域太小,导致无法解决内存的UCE。例如,内存镜像粒度大于内存交织粒度,镜像区域故障会导致多个以交织方式访问内存的数据受影响,降低存储资源的利用率。可选地,内存镜像粒度可以64字节(Bytes),配合内存交织的粒度,从而避免交织的存储区域隔离扩大导致额外的内存浪费。
在另一些实施例中,管理节点120可以构建待镜像区域和镜像区域的镜像关系,以便于管理节点120根据镜像关系确定镜像区域,对镜像区域进行读操作或写操作。
在一种示例中,待镜像区域和镜像区域的镜像关系指示待镜像区域的物理地址与镜像区域的物理地址的对应关系。镜像关系可以以表格的形式呈现,如表1所示。
表1
如表1所示,待镜像区域的物理地址1对应镜像区域的物理地址2,管理节点120根据待镜像区域的物理地址1查表,确定镜像区域的物理地址为物理地址2,根据镜像区域的物理地址2对镜像区域进行读操作或写操作。
需要说明的是,表1只是以表格的形式示意对应关系在存储设备中的存储形式,并不是对该对应关系在存储设备中的存储形式的限定,当然,该对应关系在存储设备中的存储形式还可以以其他的形式存储,本实施例对此不做限定。
步骤340、管理节点120向节点110A反馈镜像成功响应。
管理节点120根据内存镜像需求分配与待镜像区域的大小相同的镜像区域后,向节点110A反馈镜像成功响应。节点110A可以生成待镜像区域的镜像标识,镜像标识指示该待镜像区域是一个已镜像成功的区域,是一个克隆体。节点110A还可以生成待镜像区域的虚拟地址(Virtual Address,VA)和待镜像区域的物理地址(Physical Address,PA)的映射关系,以便于节点110A根据待镜像区域的虚拟地址确定待镜像区域的物理地址,对待镜像区域进行读操作或写操作。
进一步地,内存镜像配置完成后,系统中的业务执行完成、删除虚拟机、删除容器等无需备份高可靠性的数据时,可以释放内存镜像的存储资源。本申请还包括步骤350。
步骤350、管理节点120向节点110A和节点110B发送内存镜像释放指示。
在一些实施例中,管理节点120可以接收节点110A的内存镜像释放请求,内存镜像释放请求指示请求释放的待镜像区域,例如,内存镜像释放请求包括待镜像区域的物理地址和待镜像区域的大小。又如,内存镜像释放请求包括待镜像区域的物理地址段。又如内存镜像释放请求包括待镜像区域的物理地址和偏移地址。
在另一些实施例中,管理节点120确定在监控时段内节点110A的待镜像区域和节点110B的镜像区域未被使用,管理节点120确定释放节点110A的待镜像区域和节点110B的镜像区域,使待镜像区域和镜像区域可以用于存储其他数据,以提高存储资源的利用率。
管理节点120向节点110A发送的第一内存镜像释放指示,第一内存镜像释放指示包括待镜像区域的物理地址。管理节点120向节点110B发送的第二内存镜像释放指示,第二内存镜像释放指示包括镜像区域的物理地址。
节点110A根据第一内存镜像释放指示释放待镜像区域,或者将待镜像区域的镜像标识修改为无效。节点110B根据第二内存镜像释放指示释放镜像区域,或者将镜像区域的镜像标识修改为无效。
如此,本申请提供内存镜像方法不依赖于节点的操作系统,由管理节点依据内存镜像需求动态分配镜像区域实现内存镜像,无需重启配置内存镜像的主机;在无需内存镜像时动态释放内存镜像的存储资源,从而,实现更简单、更高效的动态内存镜像,提升存储资源的利用率。
在内存镜像配置完成后,采用完全复制的方式向互为镜像的物理存储空间进行写操作,实现内存镜像的效果。图4为本申请提供的一种数据处理方法的流程示意图。在这里以节点110A对待镜像区域进行写操作和读操作为例进行说明。如图4所示,该方法包括以下步骤。
步骤410、节点110A向管理节点120发送写指示。
写指示用于指示将第一数据存储到待镜像区域。例如,节点110A根据待镜像区域的虚拟地址查询地址映射表确定待镜像区域的物理地址,写指示包括待镜像区域的物理地址。地址映射表指示了虚拟地址和物理地址的映射关系。
步骤420、管理节点120将第一数据写入待镜像区域和镜像区域。
管理节点120获取到写指示后,根据写指示包括的待镜像区域的物理地址,将第一数据写入待镜像区域。
在一些实施例中,管理节点120支持CXL3.0、p2p模式等缓存一致性协议,管理节点120将第一数据写入镜像区域。
在另一些实施例中,管理节点120根据待镜像区域的物理地址查询镜像关系,确定镜像区域的物理地址,根据镜像区域的物理地址将第一数据写入镜像区域。
步骤430、节点110A向管理节点120发送读指示。
读指示用于指示从待镜像区域读取第一数据。例如,节点110A根据待镜像区域的虚拟地址查询地址映射表确定待镜像区域的物理地址,读指示包括待镜像区域的物理地址。
当待镜像区域未发生不可纠正错误时,执行步骤440。当待镜像区域发生不可纠正错误时,执行步骤450。
步骤440、管理节点120从待镜像区域读取第一数据。管理节点120将第一数据反馈给节点110A。
步骤450、管理节点120从镜像区域读取第一数据。
管理节点120确定待镜像区域发生不可纠正错误,根据待镜像区域的物理地址查询镜像关系,确定待镜像区域的镜像区域的物理地址,根据镜像区域的物理地址从镜像区域读取第一数据。
管理节点120从待镜像区域读取第一数据或从镜像区域读取第一数据后,将第一数据反馈给节点110A。
在一些实施例中,节点110A从待镜像区域读取到数据后,对从待镜像区域读取的数据进行校验,确定读取的数据发生错误,如从待镜像区域读取的数据不是第一数据。节点110A采用ECC技术无法对读取到的错误数据进行纠错,则指示管理节点120从镜像区域读取第一数据,即执行步骤450。
管理节点120支持CXL3.0、p2p模式等缓存一致性协议,从镜像区域读取第一数据后,将第一数据写入待镜像区域。
管理节点120不支持CXL3.0、p2p模式等缓存一致性协议,管理节点120将从镜像区域读取的第一数据反馈给节点110A,节点110A请求管理节点120将第一数据写入待镜像区域。
如果将第一数据成功写入待镜像区域,表示待镜像区域未发生硬件故障,可能是偶然性数据错误。如果将第一数据写入待镜像区域失败,表示待镜像区域发生硬件故障,启动待镜像区域和镜像区域进行主备倒换。
进一步地,当待镜像区域发生不可纠正错误时,管理节点120可以对待镜像区域和镜像区域进行主备倒换。例如,管理节点120将镜像区域确定为主存储空间。从而,使得节点110对第一数据进行读操作或写操作。
可以理解的是,为了实现上述实施例中的功能,管理节点包括了执行各个功能相应的硬件结 构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
上文中结合图1至图4,详细描述了根据本实施例所提供的内存镜像方法,下面将结合图5,描述根据本实施例所提供的管理装置和节点。
图5为本实施例提供的可能的管理装置的结构示意图。这些管理装置可以用于实现上述方法实施例中管理节点的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该管理装置可以是如图3或图4所示的管理节点120,还可以是应用于服务器的模块(如芯片)。
如图5所示,管理装置500包括通信模块510、控制模块520和存储模块530。管理装置500用于实现上述图3或图4中所示的方法实施例中管理节点120的功能。
通信模块510用于接收第一节点的内存镜像需求,请求对所述第一节点所使用的内存中第一区域进行镜像。例如,通信模块510用于执行图3中步骤320。
控制模块520,用于当所述第一节点请求对所述第一节点所使用的内存中第一区域进行镜像时,分配第二区域,所述第二区域为所述第一区域的镜像区域,所述第二区域用于指示第二节点中与所述第一区域的大小相同的存储空间,所述第二区域用于备份存储所述第一区域的数据。例如,控制模块520用于执行图3中步骤330。
控制模块520,还用于生成所述第一区域和所述第二区域的镜像关系,所述镜像关系用于指示所述第一物理地址与第二物理地址的对应关系,所述第二物理地址用于指示所述第二区域。
通信模块510,还用于接收对第一区域进行写操作或读操作。例如,通信模块510用于执行图3中步骤340。例如,通信模块510用于执行图4中步骤420、步骤440和步骤450。
控制模块520,还用于根据镜像关系对第一区域和第二区域进行写操作或读操作。
通信模块510,还用于向节点反馈镜像成功。例如,通信模块510用于执行图3中步骤340。
通信模块510,还用于向节点发送内存镜像释放请求。例如,通信模块510用于执行图3中步骤350。
存储模块530用于存储镜像关系,以便于控制模块520根据镜像关系访问镜像区域。
应理解的是,本申请实施例的管理装置500可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3或图4所示的内存镜像方法时,及其各个模块也可以为软件模块,管理装置500其各个模块也可以为软件模块。
根据本申请实施例的管理装置500可对应于执行本申请实施例中描述的方法,并且管理装置500中的各个单元的上述和其它操作和/或功能分别为了实现图3或图4中的各个方法的相应流程,为了简洁,在此不再赘述。
图6为本实施例提供的一种计算设备600的结构示意图。如图所示,计算设备600包括处理器610、总线620、存储器630、通信接口640和内存单元650(也可以称为主存(main memory)单元)。处理器610、存储器630、内存单元650和通信接口640通过总线620相连。
应理解,在本实施例中,处理器610可以是CPU,该处理器610还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
处理器还可以是图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、微处理器、ASIC、或一个或多个用于控制本申请方案程序执行的集成电路。
通信接口640用于实现计算设备600与外部设备或器件的通信。在本实施例中,计算设备600用于实现图1所示的管理节点120的功能时,通信接口640用于获取内存镜像需求,处理器610 分配镜像区域。计算设备600用于实现图1所示的节点110的功能时,通信接口640用于发送内存镜像需求。
总线620可以包括一通路,用于在上述组件(如处理器610、内存单元650和存储器630)之间传送信息。总线620除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线620。总线620可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线620可以分为地址总线、数据总线、控制总线等。
作为一个示例,计算设备600可以包括多个处理器。处理器可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的计算单元。在本实施例中,计算设备600用于实现图3所示的管理节点120的功能时,处理器610还用于当所述第一节点请求对所述第一节点所使用的内存中第一区域进行镜像时,分配第二区域,所述第二区域为所述第一区域的镜像区域,所述第二区域用于指示第二节点中与所述第一区域的大小相同的存储空间,所述第二区域用于备份存储所述第一区域的数据。
计算设备600用于实现图4所示的节点110的功能时,处理器610还用于请求对已申请镜像的区域进行写操作或读操作。
计算设备600用于实现图4所示的管理节点120的功能时,处理器610还用于根据镜像关系对镜像区域进行写操作或读操作。
值得说明的是,图6中仅以计算设备600包括1个处理器610和1个存储器630为例,此处,处理器610和存储器630分别用于指示一类器件或设备,具体实施例中,可以根据业务需求确定每种类型的器件或设备的数量。
内存单元650可以对应上述方法实施例中用于存储镜像关系。内存单元650可以是易失性存储器池或非易失性存储器池,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
存储器630可以对应上述方法实施例中用于存储计算机指令、内存操作指令、节点标识等信息的存储介质,例如,磁盘,如机械硬盘或固态硬盘。
上述计算设备600可以是一个通用设备或者是一个专用设备。例如,计算设备600可以是边缘设备(例如,携带具有处理能力芯片的盒子)等。可选地,计算设备600也可以是服务器或其他具有计算能力的设备。
应理解,根据本实施例的计算设备600可对应于本实施例中的管理装置500,并可以对应于执行根据图3或图4中任一方法中的相应主体,并且管理装置500中的各个模块的上述和其它操作和/或功能分别为了实现图3或图4中的各个方法的相应流程,为了简洁,在此不再赘述。
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处 理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于计算设备中。当然,处理器和存储介质也可以作为分立组件存在于计算设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (23)

  1. 一种数据处理系统,其特征在于,所述数据处理系统包括多个节点和管理节点;
    第一节点,用于请求对所述第一节点所使用的内存中第一区域进行镜像;
    所述管理节点,用于分配第二区域,所述第二区域为所述第一区域的镜像区域,所述第二区域用于指示第二节点中与所述第一区域的大小相同的存储空间,所述第二区域用于备份存储所述第一区域的数据。
  2. 根据权利要求1所述的系统,其特征在于,所述第一节点指示了所述第一区域的第一物理地址;
    所述管理节点,还用于生成所述第一区域和所述第二区域的镜像关系,所述镜像关系用于指示所述第一物理地址与第二物理地址的对应关系,所述第二物理地址用于指示所述第二区域。
  3. 根据权利要求1或2所述的系统,其特征在于,
    所述管理节点,还用于接收所述第一节点发送的写指示,所述写指示用于指示将第一数据存储到所述第一区域;
    所述管理节点,还用于将所述第一数据写入所述第一区域和所述第二区域。
  4. 根据权利要求3所述的系统,其特征在于,
    所述管理节点,还用于接收所述第一节点的读指示,所述读指示用于指示从所述第一区域读取所述第一数据;
    所述管理节点,还用于当所述第一区域未发生不可纠正错误时,从所述第一区域读取所述第一数据。
  5. 根据权利要求4所述的系统,其特征在于,
    所述管理节点,还用于当所述第一区域发生不可纠正错误时,从所述第二区域读取所述第一数据。
  6. 根据权利要求5所述的系统,其特征在于,所述第一区域为主存储空间,所述第二区域为备存储空间;
    所述管理节点,还用于当所述第一区域发生不可纠正错误时,将所述第二区域确定为主存储空间。
  7. 根据权利要求1-6中任一项所述的系统,其特征在于,
    所述管理节点,还用于指示所述第一节点将所述第一区域的镜像标识修改为无效。
  8. 根据权利要求1-7中任一项所述的系统,其特征在于,所述第一区域的大小是由应用需求确定的。
  9. 根据权利要求1-8中任一项所述的系统,其特征在于,所述第二区域包括所述第二节点的本地存储空间、所述第二节点的扩展存储空间和全局内存池中所述第二节点的存储空间中任一种。
  10. 根据权利要求1-9中任一项所述的系统,其特征在于,所述管理节点支持缓存一致性协议。
  11. 一种内存镜像方法,其特征在于,数据处理系统包括多个节点和管理节点;所述方法包括:
    第一节点请求对所述第一节点所使用的内存中第一区域进行镜像;
    所述管理节点分配第二区域,所述第二区域为所述第一区域的镜像区域,所述第二区域用于指示第二节点中与所述第一区域的大小相同的存储空间,所述第二区域用于备份存储所述第一区域的数据。
  12. 根据权利要求11所述的方法,其特征在于,所述第一节点指示了所述第一区域的第一物理地址;所述方法还包括:
    所述管理节点生成所述第一区域和所述第二区域的镜像关系,所述镜像关系用于指示所述第一物理地址与第二物理地址的对应关系,所述第二物理地址用于指示所述第二区域。
  13. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    所述管理节点接收所述第一节点发送的写指示,所述写指示用于指示将第一数据存储到所述第一区域;
    所述管理节点将所述第一数据写入所述第一区域和所述第二区域。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    所述管理节点接收所述第一节点的读指示,所述读指示用于指示从所述第一区域读取所述第一数据;
    所述管理节点当所述第一区域未发生不可纠正错误时,从所述第一区域读取所述第一数据。
  15. 根据权利要求14所述的方法,其特征在于,所述方法还包括:
    所述管理节点当所述第一区域发生不可纠正错误时,从所述第二区域读取所述第一数据。
  16. 根据权利要求15所述的方法,其特征在于,所述第一区域为主存储空间,所述第二区域为备存储空间,所述方法还包括:
    所述管理节点当所述第一区域发生不可纠正错误时,将所述第二区域确定为主存储空间。
  17. 根据权利要求11-16中任一项所述的方法,其特征在于,所述方法还包括:
    所述管理节点指示所述第一节点将所述第一区域的镜像标识修改为无效。
  18. 根据权利要求11-17中任一项所述的方法,其特征在于,所述第一区域的大小是由应用需求确定的。
  19. 根据权利要求11-18中任一项所述的方法,其特征在于,所述第二区域包括所述第二节点的本地存储空间、所述第二节点的扩展存储空间和全局内存池中所述第二节点的存储空间中任一种。
  20. 根据权利要求11-19中任一项所述的方法,其特征在于,所述管理节点支持缓存一致性协议。
  21. 一种管理装置,其特征在于,所述管理装置应用于数据处理系统,所述数据处理系统包括基于多个节点,所述多个节点包括第一节点和第二节点,所述装置包括:
    控制模块,用于当所述第一节点请求对所述第一节点所使用的内存中第一区域进行镜像时,分配第二区域,所述第二区域为所述第一区域的镜像区域,所述第二区域用于指示第二节点中与所述第一区域的大小相同的存储空间,所述第二区域用于备份存储所述第一区域的数据。
  22. 根据权利要求21所述的装置,其特征在于,所述第一节点指示了所述第一区域的第一物理地址;
    所述控制模块,还用于生成所述第一区域和所述第二区域的镜像关系,所述镜像关系用于指示所述第一物理地址与第二物理地址的对应关系,所述第二物理地址用于指示所述第二区域。
  23. 一种计算设备,其特征在于,所述计算设备包括存储器和至少一个处理器,所述存储器用于存储一组计算机指令;当所述处理器执行所述一组计算机指令时,控制器执行如权利要求11-20中任一所述的方法。
PCT/CN2023/102963 2022-09-09 2023-06-27 数据处理系统、内存镜像方法、装置和计算设备 WO2024051292A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202211105202 2022-09-09
CN202211105202.3 2022-09-09
CN202211519995.3A CN117687835A (zh) 2022-09-09 2022-11-30 数据处理系统、内存镜像方法、装置和计算设备
CN202211519995.3 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024051292A1 true WO2024051292A1 (zh) 2024-03-14

Family

ID=90127199

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/102963 WO2024051292A1 (zh) 2022-09-09 2023-06-27 数据处理系统、内存镜像方法、装置和计算设备

Country Status (2)

Country Link
CN (1) CN117687835A (zh)
WO (1) WO2024051292A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10037371B1 (en) * 2014-07-17 2018-07-31 EMC IP Holding Company LLC Cumulative backups
CN112631822A (zh) * 2019-10-07 2021-04-09 三星电子株式会社 存储器、具有其的存储系统及其操作方法
CN113282342A (zh) * 2021-05-14 2021-08-20 北京首都在线科技股份有限公司 部署方法、装置、系统、电子设备和可读存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10037371B1 (en) * 2014-07-17 2018-07-31 EMC IP Holding Company LLC Cumulative backups
CN112631822A (zh) * 2019-10-07 2021-04-09 三星电子株式会社 存储器、具有其的存储系统及其操作方法
CN113282342A (zh) * 2021-05-14 2021-08-20 北京首都在线科技股份有限公司 部署方法、装置、系统、电子设备和可读存储介质

Also Published As

Publication number Publication date
CN117687835A (zh) 2024-03-12

Similar Documents

Publication Publication Date Title
US10339047B2 (en) Allocating and configuring persistent memory
US10915245B2 (en) Allocation of external memory
US10229024B2 (en) Assisted coherent shared memory
US10747673B2 (en) System and method for facilitating cluster-level cache and memory space
US8370533B2 (en) Executing flash storage access requests
US11144399B1 (en) Managing storage device errors during processing of inflight input/output requests
US20210064234A1 (en) Systems, devices, and methods for implementing in-memory computing
US11150962B2 (en) Applying an allocation policy to capture memory calls using a memory allocation capture library
US11798124B2 (en) Resiliency schemes for distributed storage systems
US11010084B2 (en) Virtual machine migration system
US20200042496A1 (en) Key Value Store Snapshot in a Distributed Memory Object Architecture
TW200413917A (en) Method and system of managing virtualized physical memory in a multi-processor system
US11693818B2 (en) Data migration in a distributive file system
WO2023061172A1 (zh) 应用升级方法、装置、计算设备和芯片系统
WO2024051292A1 (zh) 数据处理系统、内存镜像方法、装置和计算设备
US10437471B2 (en) Method and system for allocating and managing storage in a raid storage system
TWI763331B (zh) 虛擬機器的備用方法與備用系統
US20240103740A1 (en) Storage system, data control method