CN117950921A - Memory fault processing method, memory expansion control device, electronic device and medium - Google Patents

Memory fault processing method, memory expansion control device, electronic device and medium Download PDF

Info

Publication number
CN117950921A
CN117950921A CN202410326646.2A CN202410326646A CN117950921A CN 117950921 A CN117950921 A CN 117950921A CN 202410326646 A CN202410326646 A CN 202410326646A CN 117950921 A CN117950921 A CN 117950921A
Authority
CN
China
Prior art keywords
memory
data
fault
main memory
written
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410326646.2A
Other languages
Chinese (zh)
Inventor
俞引挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Information Technologies Co Ltd
Original Assignee
New H3C Information Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Information Technologies Co Ltd filed Critical New H3C Information Technologies Co Ltd
Priority to CN202410326646.2A priority Critical patent/CN117950921A/en
Publication of CN117950921A publication Critical patent/CN117950921A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the application provides a memory fault processing method, memory expansion control equipment, electronic equipment and a medium. In this embodiment, at least one of N blocks of memory provided to the external designated device as the extended memory is configured to be used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the mirror memory is used as a backup of the main memory, and the data to be written is synchronously written into each memory without memory failure when the data is written, so that if the main memory where the data to be read is located has memory failure when the data is read, the data to be read can be read from any mirror memory which corresponds to the main memory and does not have memory failure, thereby avoiding the problem of equipment downtime caused by the main memory failure, and effectively reducing the occurrence probability of equipment downtime caused by the extended memory failure.

Description

Memory fault processing method, memory expansion control device, electronic device and medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a memory failure processing method, a memory expansion control device, an electronic device, and a medium.
Background
With the continuous development of technology, the demand of electronic devices for memory is increasing, and at this time, a memory expansion controller is needed to expand the memory of the electronic devices; such as a computing high-speed link (Compute Express Link, CXL) memory expansion controller, typically carries a large amount of expansion memory that can be used in conjunction with the host memory of the electronic device. However, in practical applications, if the expansion memory fails, the electronic device may be down, thereby affecting the operation of the electronic device.
Disclosure of Invention
In view of this, the application provides a memory fault processing method, a memory expansion control device, an electronic device and a medium, so as to reduce the occurrence probability of equipment downtime caused by expansion memory faults.
The embodiment of the application provides a memory fault processing method, which is applied to memory expansion control equipment, wherein the memory expansion control equipment provides N blocks of memories to serve as expansion memories of external designated equipment connected with the memory expansion control equipment, and N is larger than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the method comprises the following steps:
When a data writing event is detected, the data writing event indicates a first main memory to which data to be written by the data writing event are written and a first storage position of the data to be written in the first main memory, wherein the first main memory corresponds to at least one first mirror memory; for a first main memory and non-fault memories in each first mirror memory corresponding to the first main memory, synchronously writing data to be written into each non-fault memory according to a first storage position;
After the data writing is completed, feeding back the first message to an operating system of the external designated device through a central processing unit (Central Processing Unit, CPU) of the external designated device; the first message includes at least: a data identifier of the data to be written, and a memory identifier of each non-faulty memory to which the data to be written has been written; after receiving the first message, the operating system updates the recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory;
When a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading data to be read by a data reading event from any second mirror image memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
The embodiment of the application also provides a memory expansion control device, which provides N blocks of memory to serve as an expansion memory of an external designated device connected with the memory expansion control device, wherein N is larger than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the memory expansion control device comprises a memory controller;
The memory controller is used for indicating a first main memory to which data to be written by the data writing event are written and a first storage position of the data to be written in the first main memory when the data writing event is detected, wherein the first main memory corresponds to at least one first mirror memory; for a first main memory and non-fault memories in each first mirror memory corresponding to the first main memory, synchronously writing data to be written into each non-fault memory according to a first storage position;
After the data writing is completed, feeding back the first message to an operating system of the external designated equipment through a Central Processing Unit (CPU) of the external designated equipment; the first message includes at least: a data identifier of the data to be written, and a memory identifier of each non-faulty memory to which the data to be written has been written; after receiving the first message, the operating system updates the recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory;
When a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading data to be read by a data reading event from any second mirror image memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
The embodiment of the application also provides electronic equipment, which comprises:
a processor and a memory for storing computer program instructions which, when executed by the processor, cause the processor to perform the steps of the method as above.
Embodiments of the present application also provide a computer readable storage medium storing computer program instructions which, when executed, enable the steps of the method as above to be carried out.
As can be seen from the above technical solutions, in this embodiment, at least one of N blocks of memory configured to be provided to an external designated device as an extended memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the mirror memory is used as a backup of the main memory, and the data to be written is synchronously written into each memory without memory failure when the data is written, so that if the main memory where the data to be read is located has memory failure when the data is read, the data to be read can be read from any mirror memory which corresponds to the main memory and does not have memory failure, thereby avoiding the problem of equipment downtime caused by the main memory failure, and effectively reducing the occurrence probability of equipment downtime caused by the extended memory failure. Further, the embodiment further feeds back, after the data writing is completed, the first information indicating the memory identifier of each non-faulty memory to which the data to be written is written to the operating system of the external designated device, so that the operating system can update the recorded memory mapping table based on the first information, so that the operating system can learn the memory to which the data is written, and can be used for determining the memory in which the data to be read is located for a subsequent request for reading the data.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic diagram of implementing CXL memory expansion provided in an embodiment of the present application.
Fig. 2 is a flow chart of a memory failure processing method according to an embodiment of the present application.
Fig. 3 is a schematic implementation diagram of a CXL memory expansion controller according to an embodiment of the present application.
Fig. 4 is a schematic diagram of another implementation of a CXL memory expansion controller according to an embodiment of the present application.
Fig. 5 is a schematic diagram of an implementation of extended memory fault handling according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of a memory expansion control device according to an embodiment of the present application.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
In order to facilitate understanding of the technical solution of the embodiment of the present application, the technical concept related to the embodiment of the present application is explained below.
CXL: is an interconnect designed for memory expansion, heterogeneous computing, and system resource decomposition.
CXL memory expansion controller chip: is Compute Express Link ™ (CXL ™) DRAM memory controller, belonging to the third device type defined by CXL protocol. The chip supports JEDEC DDR4 and DDR5 standards, accords with CXL 2.0 standards, and supports PCIe < 5.0 > rate. As shown in fig. 1, the chip can provide a high-bandwidth, low-latency and high-speed interconnection solution for a CPU and a device based on the CXL protocol (such as an externally inserted memory expansion card), thereby realizing memory sharing between the CPU and each CXL device, and significantly reducing software stack complexity and Total Cost of Ownership (TCO) of a data center while greatly improving system performance.
DDR memory: also referred to as double rate synchronous dynamic random access memory (double rate synchronous dynamic random access memory) (Double Data Rate Random Access Memory), is a type of computer memory.
In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 2, fig. 2 is a flow chart of a memory failure processing method according to an embodiment of the present application. The method can be applied to memory expansion control equipment, wherein the memory expansion control equipment provides N blocks of memory to serve as expansion memory of external designated equipment connected with the memory expansion control equipment, and N is larger than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory. Here, the memory expansion control device may refer to, for example, a CXL memory expansion control device. The external specification device may refer to an electronic device such as a server.
As shown in fig. 2, the process may include the steps of:
S201, when a data writing event is detected, the data writing event indicates a first main memory to which data to be written by the data writing event is written and a first storage position of the data to be written in the first main memory, wherein the first main memory corresponds to at least one first mirror memory; and synchronously writing the data to be written into the non-fault memories according to the first storage position aiming at the first main memory and the non-fault memories in the first mirror memories corresponding to the first main memory.
In this embodiment, the storage location may refer to a storage address for storing data. Non-failed memory may be considered extended memory without memory failure.
As one embodiment, the memory expansion control device periodically detects whether each memory in the N blocks of memory has a memory failure. When the data writing event is detected, if the first main memory to which the data to be written is written and the first mirror memories corresponding to the first main memory are found to have no memory faults, the data to be written by the data writing event can be synchronously written into the first main memory and the corresponding first mirror memories. If the first main memory and the corresponding first mirror memories are found to have the fault memory, the data writing operation is not executed aiming at the fault memory, and the data to be written can be synchronously written into the first main memory and the corresponding non-fault memory in the first mirror memories. The mirror memory is equivalent to the backup of the main memory, and the data to be written is synchronously written into the main memory and the mirror memory, so that the data to be read can be read from the corresponding mirror memory if the main memory fails during the subsequent data reading, thereby avoiding equipment downtime caused by the expansion memory failure.
As an embodiment, the method further comprises: in the process of synchronously writing the data to be written into each non-fault memory, if any non-fault memory is found to have a memory fault, stopping the continuous writing of the data to be written.
Alternatively, as an embodiment, there are many specific implementations of detecting whether each memory in the N blocks of memories has a memory failure, for example, an uncorrectable error (Uncorrectable Error, UCE) mechanism may be used to detect whether the memory has a memory failure, or other methods for detecting a memory failure may be used, which are not limited herein. Optionally, the memory failure at least includes: memory self failure, and memory slot failure to which the memory belongs.
Optionally, as an embodiment, after detecting that the memory has a memory failure, the target user (such as a related technician responsible for monitoring and processing the memory failure) may feed back alarm information indicating that the memory has a memory failure to the target user, so that the target user can process the failed memory in time to restore the failed memory to a usable state; for example, if the warning information indicates that the memory itself fails (or the memory slot to which the memory belongs fails), the target user may solve the memory failure problem by replacing the new memory bank (or repairing the memory slot), so that the memory is restored to the usable state.
Optionally, as an embodiment, the specific implementation manner of detecting the data writing event is many, for example, when receiving a data writing message sent by the CPU, determining that the data writing event is detected; the data writing message at least carries: the method comprises the steps of writing data, indicating a memory identifier of a first main memory to which the data to be written are written, and indicating a position identifier of a first storage position of the data to be written in the first main memory; the memory identifier may refer to a memory number, for example. Location identification may refer to a memory storage address, for example.
The first main memory and the first storage location are determined by the operating system based on the set memory management mechanism, the recorded configuration information, and the data to be written; the memory resources at least comprise local memory and N blocks of memory of the external designated equipment; the configuration information at least comprises memory identifiers of memories in the memory resource, available states of each memory, and main memory and mirror memory configuration associated with the N blocks of memories. Here, the memory management mechanism may refer to a mechanism for memory allocation and management that is configured in advance; the specific content of the memory management mechanism is not limited herein, and can be flexibly set according to actual requirements. The available state may refer to a state of whether the memory is available, for example, a state that the failed memory is marked as unavailable. The configuration of the main memory and the mirror memory associated with the N blocks of memory may be understood as the configuration of the main memory and the mirror memory of the N blocks of memory provided by the memory expansion control device, for example, which memories are main memories, and for any main memory, which memories are mirror memories of the main memory.
As an embodiment, when an application program or a process of an external designated device (such as a server) needs memory to write data, a memory application request is initiated to an operating system, and the operating system can determine memory allocation information such as a memory to which the data to be written is written and a storage location of the memory to be written based on a set memory management mechanism, recorded configuration information and the data to be written after receiving the request, and communicate the determined memory allocation information to a CPU. The CPU is used for forming a data writing message based on the determined memory allocation information and the data to be written, and if the memory to be written is an expansion memory, the data writing message is sent to the memory expansion control equipment.
S202, after data writing is completed, feeding back a first message to an operating system of the external designated equipment through a CPU of the external designated equipment; the first message includes at least: a data identification of the data to be written, and a memory identification of each non-failed memory to which the data to be written has been written.
In this embodiment, the data identifier may refer to a virtual address, which is not specifically limited herein, and may be capable of identifying data; virtual addresses may refer herein to abstract addresses used by an application or process. After receiving the first message, the operating system updates the recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory.
For example, after receiving the first message, the operating system may map the data identifier and the memory identifier of each non-faulty memory to the recorded memory mapping table based on the data identifier of the data to be written in the first message and the memory identifier of each non-faulty memory to which the data to be written is written.
S203, when a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading data to be read by a data reading event from any second mirror image memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
In this embodiment, as an embodiment, there are many specific implementations of detecting a data reading event, for example, when receiving a data request packet sent by a CPU, determining that the data reading event is detected; the data request message at least carries: a memory identifier indicating a second main memory in which the data to be requested is located, and a location identifier indicating a second storage location of the data to be requested in the second main memory; the second main memory and the second storage location are determined by the operating system based on the recorded memory map and the data identification of the data to be requested.
For example, when an application program or a process of an external designated device (such as a server) needs to request data, a data request is initiated to a CPU, the CPU notifies an operating system to determine information such as a memory in which the data to be requested is located and a storage location of the data in the memory, based on a data identifier of the data to be requested and a recorded memory map table, and the determined information is transmitted to the CPU. The CPU is used for forming a data request message based on the information determined by the operating system, and if the memory in which the data to be requested is located is an extended memory, the data request message is sent to the memory extended control equipment, wherein the data request message at least carries: a memory identifier indicating a main memory in which the data to be requested is located, and a location identifier indicating a storage location of the data to be requested in the main memory.
As one embodiment, after the data reading is completed, a second message is fed back to the CPU, where the second message includes at least: data to be read. After receiving the data to be read, the CPU feeds back the read data to the application or process requesting the data.
Optionally, as an embodiment, the method further includes: in the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, re-reading the data to be read by the data reading event from any second mirror memory without the memory fault; or in the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, continuing to read the unread data in the data to be read by the data reading event from any second mirror image memory which has no memory fault.
Optionally, as an embodiment, the method further includes: detecting whether each memory in the N memories has memory faults or not every set period; here, the set period may be, for example, 5 seconds, 10 seconds, or the like.
If at least one fault memory is detected and at least one fault main memory exists in each fault memory, switching any one of the non-fault mirror memories into the main memory when at least one non-fault mirror memory exists in each mirror memory corresponding to any fault main memory. Therefore, the main memory which can be directly accessed by the external designated equipment can be ensured to be the non-fault memory all the time, and the data can be conveniently read from the main memory.
Feeding back third information to the operating system through the CPU; the third information indicates at least: memory identification of each fault memory and switching information for indicating switching between the main memory and the mirror memory; after receiving the third information, the operating system updates the recorded configuration information based on the third information, and updates the recorded memory mapping table based on the third information. For example, the operating system may mark, based on the third information, the memory in the configuration information that matches the memory identifier of each failed memory as an unavailable memory, and update the configuration of the main memory and the mirror memory in the configuration information.
If at least one fault memory is detected and the fault main memory does not exist in each fault memory, fourth information is fed back to the operating system through the CPU; the fourth information indicates at least: memory identification of each fault memory; after receiving the fourth information, the operating system updates the recorded configuration information based on the fourth information and updates the recorded memory mapping table based on the fourth information. How to update the recorded memory map based on the fourth information can refer to the above-mentioned related steps of updating the recorded memory map based on the third information, which are not described herein.
Optionally, as an embodiment, for any main memory, the main memory and each corresponding mirror memory belong to the same memory group; the method further comprises the steps of: and aiming at any fault memory, when the fault memory is found to have no memory fault, processing the fault memory based on any non-fault memory in the memory group to which the fault memory belongs, so that the data stored in the fault memory is consistent with the data stored in the non-fault memory. Therefore, when the fault memory is found to have no memory fault, the data recovery can be performed on the fault memory in time based on the data stored in any other non-fault memory of the memory group to which the fault memory belongs, and if the data stored in the non-fault memory can be copied into the fault memory.
Thus, the flow shown in fig. 2 is completed.
As can be seen from the flow shown in fig. 2, in this embodiment, at least one of N blocks of memory configured to be provided to the external designated device as the extended memory is used as the main memory, and the rest is used as the mirror memory; any main memory corresponds to at least one mirror memory; the mirror memory is used as a backup of the main memory, and the data to be written is synchronously written into each memory without memory failure when the data is written, so that if the main memory where the data to be read is located has memory failure when the data is read, the data to be read can be read from any mirror memory which corresponds to the main memory and does not have memory failure, thereby avoiding the problem of equipment downtime caused by the main memory failure, and effectively reducing the occurrence probability of equipment downtime caused by the extended memory failure. Further, the embodiment further feeds back, after the data writing is completed, the first information indicating the memory identifier of each non-faulty memory to which the data to be written is written to the operating system of the external designated device, so that the operating system can update the recorded memory mapping table based on the first information, so that the operating system can learn the memory to which the data is written, and can be used for determining the memory in which the data to be read is located for a subsequent request for reading the data.
In order to facilitate understanding of the specific implementation process of the memory failure processing, the following description is given by way of example through specific embodiments.
Hardware failure is an important factor of server downtime, while memory failure accounts for 50% of the hardware failure as a whole. If the problem of server downtime caused by memory faults can be effectively solved, the server downtime condition can be greatly improved. In the embodiment of the present application, as shown in fig. 3, the CXL memory expansion controller (i.e., the computing high-speed link memory controller) may expand 2 memory banks, and all of the 2 main memory banks are used as main memory; however, in practical application, if the main memory fails, the server may be down. Therefore, as shown in fig. 4, the embodiment of the present application provides a CXL memory failure processing method, which creates a mirror image mode (i.e., mirror image memory) of the CXL memory provided by the CXL memory expansion controller, so as to use the mirror image mode as a backup memory of the main memory, and when a memory failure occurs in the main memory, the corresponding data can be directly returned from the backup memory to the application program of the host, thereby avoiding downtime caused by a memory slot or a memory failure, and thus greatly reducing server downtime caused by the CXL memory failure.
In fig. 3 and 4, the CXL Type3 device (i.e., a computing high speed link protocol based device) refers to one particular device Type that complies with the CXL specification. A command interface (command interface) for an interface for communication and control of a device or system. An Event Status register (Event Status Reg) is a hardware register used to record and report Status information of various events occurring in a device or system. Mailboxes (mailboxes) refer to mailboxes in hardware, and may be considered a communication mechanism for communicating messages or data between different processors or devices.
The embodiment relates to the following key technical points: CXL memory stripe Reliability, availability and maintainability (RAS) handling techniques, and CXL memory expansion controller mirror mode implementation.
Technical point 1: CXL memory bank RAS treatment process:
In this embodiment, referring to fig. 5, the CXL memory failure RAS flow is: 1. the application program of the host requests data from the CXL controller (namely the CXL memory expansion controller) (in the process, the M2S Req message of CXL.mem protocol can be used for data transmission); 2. the device (i.e., the CXL memory expansion controller) fetches data from the main CXL memory, and the CXL controller detects whether a UCE signal exists; 3. if the UCE signal exists, the device returns data with poison bits (in the process, the S2M DRS and S2M data of CXL.mem protocol can be used for data transmission); 4. the host CPU detects poison (i.e. data with poison bit) in the received data and triggers MCERR (MACHINE CHECK Error) abnormality, and after triggering abnormality, the server is down. Where a poison bit refers to a hardware damage or failure occurring at a location in a circuit or memory that causes its read or written data value to be different from the normally expected value. Data with poison bits generally refers to data that is affected by such hardware failures.
In this embodiment, after detecting that the UCE signal exists in the memory, the CXL controller may determine that a memory failure exists in the main CXL memory, and then may directly return data from the backup memory bank of the main CXL memory to the application program of the host computer, so as to avoid downtime caused by the memory failure.
In this embodiment, user space: refers to a portion of an operating system in which applications and user-level processes are running. User util: refers to a set of utilities provided to users in an operating system. The OS refers to an operating system. CXL Mailbox Driver (CXL Mailbox driver) is a driver for use in the Linux operating system that implements the Mailbox function in the CXL protocol. Mailbox is a lightweight communication mechanism that allows for the exchange of commands and responses between CXL devices to enable the management and control of the devices. The Host CPU refers to a Host CPU. Host memory controller (host memory controller) refers to the hardware components in the physical host responsible for managing and controlling the memory storage. Core(s) refers to the number of compute cores on a processor chip. Host cache refers to a cache on a Host (computer) for storing temporary data.
According to the embodiment, when the fatal memory error is detected, and the CXL controller runs in the mirror mode, data is directly returned from the backup memory bank (namely mirror CXL memory) to the application program of the host, so that downtime caused by memory faults such as memory slot positions or memory bank faults is avoided. The embodiment can provide fault detection of the CXL expansion memory in real time, and switch the main CXL memory and the backup CXL memory in real time when the main CXL memory has memory faults.
Technical point 2: the CXL memory controller needs to implement the functions in mirror mode:
1) When writing, the CXL controller needs to perform double writing on the underhung memories (namely the main CXL memory and the corresponding mirror CXL memory), namely, data is simultaneously written into the main CXL memory and the mirror CXL memory;
2) When the main CXL memory bar is detected to have uncorrectable errors during reading, the correct data is tried to be read from the backup CXL memory bar and returned to the host CPU;
3) Detecting faults of the memory group or the memory slot in real time, actively switching the main memory and the backup memory (even if data reading is not triggered at the moment) when errors of the main CXL memory or the memory slot are detected, and carrying out corresponding warning;
4) Providing mirror mode meeting condition detection, such as backup memory fault detection, non-bit (i.e. whether the memory bank is inserted into the corresponding slot) detection, memory group mismatch (i.e. whether the memory bank is the memory bank corresponding to the memory expansion card) detection, etc., and carrying out corresponding alarm in parallel.
The embodiment of the application aims to greatly reduce the downtime risk caused by the faults of CXL memory bank slots and/or memory banks. When detecting a read poison error or detecting an uncorrectable memory error, switching to a backup memory bank to read the data returned to the application program. And simultaneously, the method and the device provide real-time detection of the pre-fault condition of the main memory and the standby memory and perform switching of the main memory and the standby memory in advance. Compared with the backup mode of the whole system, the backup mode based on the memory bank can manage the downtime risk caused by the memory fault in a finer granularity. The application scenarios applicable to this embodiment are, for example: is insensitive to memory bank cost, but requires very high scenes for data security, system stability and high availability.
Thus, the description of the method provided in this embodiment is completed, and the following describes the device provided in this embodiment of the present application:
referring to fig. 6, fig. 6 is a schematic structural diagram of a memory expansion control device according to an embodiment of the present application.
As shown in fig. 6, the memory expansion control device 600 provides N blocks of memory as the expansion memory of the external designated device to which the memory expansion control device is connected, N being greater than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the memory expansion control device includes a memory controller 601;
The memory controller 601 is configured to, when a data writing event is detected, indicate a first main memory to which data to be written by the data writing event is to be written, and a first storage location of the data to be written in the first main memory, where the first main memory corresponds to at least one first mirror memory; for a first main memory and non-fault memories in each first mirror memory corresponding to the first main memory, synchronously writing data to be written into each non-fault memory according to a first storage position;
After the data writing is completed, feeding back the first message to an operating system of the external designated equipment through a CPU of the external designated equipment; the first message includes at least: a data identifier of the data to be written, and a memory identifier of each non-faulty memory to which the data to be written has been written; after receiving the first message, the operating system updates the recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory;
When a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading data to be read by a data reading event from any second mirror image memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
As one embodiment, detecting a data write event includes:
when a data writing message sent by a CPU is received, determining that a data writing event is detected;
The data writing message at least carries: the method comprises the steps of writing data, indicating a memory identifier of a first main memory to which the data to be written are written, and indicating a position identifier of a first storage position of the data to be written in the first main memory;
The first main memory and the first storage location are determined by the operating system based on the set memory management mechanism, the recorded configuration information, and the data to be written; the memory resources at least comprise local memory and N blocks of memory of the external designated equipment; the configuration information at least comprises memory identifiers of memories in the memory resource, available states of each memory, and main memory and mirror memory configuration associated with the N blocks of memories.
As one embodiment, detecting a data read event includes:
when a data request message sent by a CPU is received, determining that a data reading event is detected;
The data request message at least carries: a memory identifier indicating a second main memory in which the data to be requested is located, and a location identifier indicating a second storage location of the data to be requested in the second main memory; the second main memory and the second storage location are determined by the operating system based on the recorded memory map and the data identification of the data to be requested.
As an embodiment, the method further comprises:
Detecting whether each memory in the N memories has memory faults or not every set period;
if at least one fault memory is detected and at least one fault main memory exists in each fault memory, switching any one of the non-fault mirror memories into a main memory when at least one non-fault mirror memory exists in each mirror memory corresponding to any fault main memory;
Feeding back third information to the operating system through the CPU; the third information indicates at least: memory identification of each fault memory and switching information for indicating switching between the main memory and the mirror memory; after receiving the third information, the operating system updates the recorded configuration information based on the third information and updates the recorded memory mapping table based on the third information;
if at least one fault memory is detected and the fault main memory does not exist in each fault memory, fourth information is fed back to the operating system through the CPU; the fourth information indicates at least: memory identification of each fault memory; after receiving the fourth information, the operating system updates the recorded configuration information based on the fourth information and updates the recorded memory mapping table based on the fourth information.
As an embodiment, the method further comprises:
In the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, re-reading the data to be read by the data reading event from any second mirror memory without the memory fault;
Or alternatively
In the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, reading unread data in the data to be read by the data reading event from any second mirror image memory without the fault of the memory continuously.
As an embodiment, the method further comprises:
In the process of synchronously writing the data to be written into each non-fault memory, if any non-fault memory is found to have a memory fault, stopping the continuous writing of the data to be written.
As an embodiment, for any main memory, the main memory and each corresponding mirror memory belong to the same memory group;
the method further comprises the steps of:
And aiming at any fault memory, when the fault memory is found to have no memory fault, processing the fault memory based on any non-fault memory in the memory group to which the fault memory belongs, so that the data stored in the fault memory is consistent with the data stored in the non-fault memory.
The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The above-described embodiments of the apparatus are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Referring to fig. 7, a schematic hardware structure of an electronic device according to an exemplary embodiment of the application is shown. The electronic device may include a processor 701, a communication interface 702, a memory 703, and a communication bus 704. The processor 701, the communication interface 702, and the memory 703 perform communication with each other via the communication bus 704. Wherein the memory 703 has stored thereon a computer program; the processor 701 can execute the steps of the method described in the above embodiment by executing a program stored on the memory 703. The electronic device may further include other hardware according to the actual function of the electronic device, which will not be described in detail.
Embodiments of the subject matter and functional operations described in this disclosure may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this application and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this application contains many specific implementation details, these should not be construed as limitations on the scope of any application or of what may be claimed, but rather as features of specific embodiments of particular applications. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the application.

Claims (10)

1. The memory fault processing method is characterized in that the method is applied to a memory expansion control device, wherein the memory expansion control device provides N blocks of memory to serve as an expansion memory of an external designated device connected with the memory expansion control device, and N is larger than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the method comprises the following steps:
When a data writing event is detected, the data writing event indicates a first main memory to which data to be written by the data writing event is written and a first storage position of the data to be written in the first main memory, wherein the first main memory corresponds to at least one first mirror memory; for the first main memory and the non-fault memories in the first mirror memories corresponding to the first main memory, synchronously writing the data to be written into the non-fault memories according to the first storage position;
After the data writing is completed, feeding back a first message to an operating system of the external designated equipment through a Central Processing Unit (CPU) of the external designated equipment; the first message includes at least: a data identifier of the data to be written and a memory identifier of each non-fault memory in which the data to be written is written; after receiving the first message, the operating system updates a recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory;
When a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading the data to be read by the data reading event from any second mirror memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
2. The method of claim 1, wherein the detecting a data write event comprises:
when a data writing message sent by the CPU is received, determining that the data writing event is detected;
The data writing message at least carries: the method comprises the steps of writing data, indicating a memory identifier of a first main memory to which the data to be written are written, and indicating a position identifier of a first storage position of the data to be written in the first main memory;
the first main memory and the first storage location are determined by the operating system based on a set memory management mechanism, recorded configuration information, and the data to be written; the memory resource at least comprises a local memory of the external designated equipment and the N blocks of memories; the configuration information at least comprises memory identifiers of memories in the memory resources, available states of each memory, and main memory and mirror memory configurations associated with the N blocks of memories.
3. The method of claim 1, wherein the detecting a data read event comprises:
when a data request message sent by the CPU is received, determining that the data reading event is detected;
The data request message at least carries: a memory identifier indicating a second main memory in which the data to be requested is located, and a location identifier indicating a second storage location of the data to be requested in the second main memory; the second main memory and the second storage location are determined by the operating system based on the recorded memory map and the data identification of the data to be requested.
4. A method according to any one of claims 1-3, characterized in that the method further comprises:
Detecting whether each memory in the N memories has a memory fault or not every set period;
if at least one fault memory is detected and at least one fault main memory exists in each fault memory, switching any one of the non-fault mirror memories into a main memory when at least one non-fault mirror memory exists in each mirror memory corresponding to any fault main memory;
Feeding third information back to the operating system through the CPU; the third information indicates at least: memory identification of each fault memory and switching information for indicating switching between the main memory and the mirror memory; after receiving the third information, the operating system updates the recorded configuration information based on the third information and updates the recorded memory mapping table based on the third information;
If at least one fault memory is detected and the fault main memory does not exist in each fault memory, fourth information is fed back to the operating system through the CPU; the fourth information indicates at least: memory identification of each fault memory; after receiving the fourth information, the operating system updates the recorded configuration information based on the fourth information, and updates the recorded memory mapping table based on the fourth information.
5. The method according to claim 1, characterized in that the method further comprises:
In the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, re-reading the data to be read by the data reading event from any second mirror memory without the memory fault;
Or alternatively
And in the process of reading the data to be read by the data reading event from the second main memory, if the second main memory is found to have a fault, continuing to read unread data in the data to be read by the data reading event from any second mirror memory which has no memory fault.
6. The method according to claim 1, characterized in that the method further comprises:
And in the process of synchronously writing the data to be written into each non-fault memory, if any non-fault memory is found to have a memory fault, stopping the continuous writing of the data to be written.
7. The method of claim 1, wherein for any main memory, the main memory and the corresponding mirror memories belong to the same memory bank;
the method further comprises the steps of:
And aiming at any fault memory, when the fault memory is found to have no memory fault, processing the fault memory based on any non-fault memory in the memory group to which the fault memory belongs, so that the data stored in the fault memory is consistent with the data stored in the non-fault memory.
8. The memory expansion control device is characterized in that the memory expansion control device provides N blocks of memory as the expansion memory of an external designated device connected with the memory expansion control device, wherein N is larger than 1; at least one of the N blocks of memory is used as a main memory, and the rest is used as a mirror memory; any main memory corresponds to at least one mirror memory; the memory expansion control device comprises a memory controller;
The memory controller is configured to, when a data writing event is detected, indicate a first main memory to which data to be written by the data writing event is to be written, and a first storage location of the data to be written in the first main memory, where the first main memory corresponds to at least one first mirror memory; for the first main memory and the non-fault memories in the first mirror memories corresponding to the first main memory, synchronously writing the data to be written into the non-fault memories according to the first storage position;
After the data writing is completed, feeding back a first message to an operating system of the external designated equipment through a Central Processing Unit (CPU) of the external designated equipment; the first message includes at least: a data identifier of the data to be written and a memory identifier of each non-fault memory in which the data to be written is written; after receiving the first message, the operating system updates a recorded memory mapping table based on the first message; the memory mapping table indicates the mapping relation between the data identification of the stored data in the memory resource and the memory identification corresponding to the stored memory and the position identification of the storage position of the stored memory;
When a data reading event is detected, the data reading event indicates a second main memory in which data to be read by the data reading event is located and a second storage position of the data to be read in the second main memory, wherein the second main memory corresponds to at least one second mirror memory; if the second main memory is found to have no memory fault, reading data to be read by the data reading event from the second main memory according to the second storage position; if the second main memory is found to have memory faults, reading the data to be read by the data reading event from any second mirror memory without memory faults according to the second storage position; after the data reading is completed, feeding back a second message to the CPU, wherein the second message at least comprises: data to be read.
9. An electronic device, comprising:
A processor; and
A memory in which computer program instructions are stored which, when executed by the processor, cause the processor to perform the steps of the method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that it has stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps of the method according to any of claims 1 to 7.
CN202410326646.2A 2024-03-20 2024-03-20 Memory fault processing method, memory expansion control device, electronic device and medium Pending CN117950921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410326646.2A CN117950921A (en) 2024-03-20 2024-03-20 Memory fault processing method, memory expansion control device, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410326646.2A CN117950921A (en) 2024-03-20 2024-03-20 Memory fault processing method, memory expansion control device, electronic device and medium

Publications (1)

Publication Number Publication Date
CN117950921A true CN117950921A (en) 2024-04-30

Family

ID=90805361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410326646.2A Pending CN117950921A (en) 2024-03-20 2024-03-20 Memory fault processing method, memory expansion control device, electronic device and medium

Country Status (1)

Country Link
CN (1) CN117950921A (en)

Similar Documents

Publication Publication Date Title
US8745323B2 (en) System and method for controller independent faulty memory replacement
EP3132449B1 (en) Method, apparatus and system for handling data error events with memory controller
US3668644A (en) Failsafe memory system
CN102024044B (en) Distributed file system
CN1770110B (en) Method and system for lockless infinibandtm poll for I/O completion
US7124244B2 (en) Storage system and a method of speeding up writing data into the storage system
CN110807064B (en) Data recovery device in RAC distributed database cluster system
CN107241430A (en) A kind of enterprise-level disaster tolerance system and disaster tolerant control method based on distributed storage
US9471449B2 (en) Performing mirroring of a logical storage unit
US20090327481A1 (en) Adaptive data throttling for storage controllers
CN103942112A (en) Magnetic disk fault-tolerance method, device and system
CN109656895B (en) Distributed storage system, data writing method, device and storage medium
CN110896406A (en) Data storage method and device and server
US11709745B2 (en) Method for a reliability, availability, and serviceability-conscious huge page support
CN109491609B (en) Cache data processing method, device and equipment and readable storage medium
CN109446169A (en) A kind of double control disk array shared-file system
US20220334733A1 (en) Data restoration method and related device
CN106815094B (en) Method and equipment for realizing transaction submission in master-slave synchronization mode
CN100543743C (en) Multiple machine file storage system and method
US20240013851A1 (en) Data line (dq) sparing with adaptive error correction coding (ecc) mode switching
CN111488247B (en) High availability method and equipment for managing and controlling multiple fault tolerance of nodes
CN112650612A (en) Memory fault positioning method and device
CN113051428B (en) Method and device for back-up storage at front end of camera
US10467100B2 (en) High availability state machine and recovery
CN117950921A (en) Memory fault processing method, memory expansion control device, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination