CN118132350A - CXL memory fault tolerance method, server system, storage medium and electronic equipment - Google Patents

CXL memory fault tolerance method, server system, storage medium and electronic equipment Download PDF

Info

Publication number
CN118132350A
CN118132350A CN202410532219.XA CN202410532219A CN118132350A CN 118132350 A CN118132350 A CN 118132350A CN 202410532219 A CN202410532219 A CN 202410532219A CN 118132350 A CN118132350 A CN 118132350A
Authority
CN
China
Prior art keywords
cxl
memory
memory device
group
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410532219.XA
Other languages
Chinese (zh)
Inventor
谢志勇
董刚
赵健
李仁刚
张闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410532219.XA priority Critical patent/CN118132350A/en
Publication of CN118132350A publication Critical patent/CN118132350A/en
Pending legal-status Critical Current

Links

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The embodiment of the application provides a CXL memory fault tolerance method, a server system, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring parameter values of a group of operation parameters of CXL memory devices in the CXL memory device group, wherein the group of operation parameters are used for representing the operation states of the corresponding CXL memory devices; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of a group of operation parameters; and under the condition that the abnormal memory device running abnormally exists in the CXL memory device group, controlling the memory data in the abnormal memory device to execute migration operation so as to migrate the memory data in the abnormal memory device to a target memory device with normal running state in the CXL memory device group. The application solves the problem of low memory utilization rate of the server caused by the occupation of the server slot by the hot standby memory in the CXL memory fault tolerance method in the related technology.

Description

CXL memory fault tolerance method, server system, storage medium and electronic equipment
Technical Field
The embodiment of the application relates to the field of computers, in particular to a CXL memory fault tolerance method, a server system, a storage medium and electronic equipment.
Background
Currently, to improve the performance and availability of cloud infrastructure, CXL (Compute Express Link, computing high-speed link) memory devices may be used to extend the memory capabilities of the cloud infrastructure. When CXL memory equipment fails, the problems of data loss and the like can be caused, and even a system crash server is down when serious problems occur. Therefore, to promote system availability, it is desirable to manage CXL memory and provide fault tolerance schemes.
In the related art, a memory hot standby technology may be used to perform hot standby of the memory, so as to perform fault-tolerant control on the memory. In the memory hot standby technology, the hot standby memory is not used under normal conditions; when the failure times of the working memory reach the preset conditions, the system automatically transmits the data in the failure memory bank to the hot standby memory bank, and the failure memory bank is not used any more.
However, the hot standby memory occupies the server slot, resulting in a reduction in the available memory space of the server. Therefore, the CXL memory fault tolerance method in the related art has the problem of low memory utilization rate of the server caused by the occupation of the server slot by the hot standby memory.
Disclosure of Invention
The embodiment of the application provides a CXL memory fault-tolerant method, a server system, a storage medium and electronic equipment, which at least solve the problem that the CXL memory fault-tolerant method in the related art has low memory utilization rate of a server caused by the fact that a hot standby memory occupies a server slot.
According to one embodiment of the present application, there is provided a CXL memory fault tolerance method applied to a server system, the server system including a CXL memory device group, a CXL switch group, and a CXL host group, the CXL memory device group and the CXL host group being connected to the CXL switch group by a CXL bus, the CXL switch group being configured to connect the CXL memory device group and the CXL host group; the method comprises the following steps: acquiring parameter values of a set of operation parameters of CXL memory devices in the CXL memory device group, wherein the CXL memory devices in the CXL memory device group are all in an operation state, and the set of operation parameters are used for representing the operation state of the corresponding CXL memory devices; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of the group of operation parameters; and under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, controlling to execute migration operation on the memory data in the abnormal memory device so as to migrate the memory data in the abnormal memory device to a target memory device with normal operation state in the CXL memory device group, wherein the abnormal memory device after data migration is removed from the CXL memory device group.
According to another embodiment of the present application, there is provided a server system including: the CXL memory device comprises a CXL memory device group, a CXL exchange unit, a CXL host unit and control equipment, wherein the CXL memory device group and the CXL host unit are connected with the CXL exchange unit through a CXL bus, the CXL exchange unit is used for connecting the CXL memory device group and the CXL host unit, the control equipment is used for acquiring parameter values of a group of operation parameters of CXL memory devices in the CXL memory device group, the CXL memory devices in the CXL memory device group are all in an operation state, and the group of operation parameters are used for representing the operation state of the corresponding CXL memory devices; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of the group of operation parameters; and under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, controlling to execute migration operation on the memory data in the abnormal memory device so as to migrate the memory data in the abnormal memory device to a target memory device with normal operation state in the CXL memory device group, wherein the abnormal memory device after data migration is removed from the CXL memory device group.
According to yet another aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the steps of any of the method embodiments described above.
According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the method and the device, for the server system comprising the CXL memory device group, the CXL exchange group and the CXL host group, the parameter values of a group of operation parameters of the CXL memory device for representing the operation state of the CXL memory device can be obtained, so that the operation state of the CXL memory device is predicted based on the obtained parameter values, and if the operation state of the CXL memory device is predicted to be abnormal, memory data in the CXL memory device with abnormal operation can be migrated to the CXL memory device with normal operation, and because the CXL memory device can still operate (i.e. not be in a down state) during prediction, the memory data in the CXL memory device with abnormal operation can be migrated out before the CXL memory device with abnormal operation is down, the influence of the device abnormality on the server system can be reduced, and meanwhile, the problem that the CXL memory fault tolerance method in the related technology has low memory utilization rate of a server due to the occupation of a hot standby memory slot can be solved, and the technical effect of improving the utilization rate of the server can be achieved.
Drawings
Fig. 1 is a hardware structural block diagram of a server device of a CXL memory fault tolerance method according to an embodiment of the present application.
Fig. 2 is a block diagram of an alternative server system according to an embodiment of the present application.
FIG. 3 is a flow chart of an alternative CXL memory fault tolerance method according to an embodiment of the application.
Fig. 4 is a block diagram of an alternative server system according to an embodiment of the present application.
FIG. 5 is a block diagram of an alternative CXL memory device according to an embodiment of the application.
Fig. 6 is a schematic diagram of an alternative CXL memory management information table according to an embodiment of the application.
FIG. 7 is a flow chart of an alternative CXL memory fault tolerance method according to an embodiment of the application.
Fig. 8 is a block diagram of still another alternative server system according to an embodiment of the present application.
FIG. 9 is a block diagram of an alternative computer system according to an embodiment of the application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be executed in a server apparatus or similar computing device. Taking the example of running on a server device, fig. 1 is a block diagram of a hardware structure of a server device of a CXL memory fault tolerance method according to an embodiment of the present application. As shown in fig. 1, the server device may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like processing means) and a memory 104 for storing data, wherein the server device may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 1 is merely illustrative and is not intended to limit the architecture of the server apparatus described above. For example, the server device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store computer programs, such as software programs and modules of application software, such as computer programs corresponding to the CXL memory fault tolerance method in the embodiments of the present application, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, that is, implement the methods described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to the server device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server device. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a CXL memory fault tolerance method is provided, which may be applied to a server system, as shown in fig. 2, where the server system may include a CXL memory device group 202, a CXL switch group 204, and a CXL host group 206, where the CXL memory device group 202 and the CXL host group 206 are connected to the CXL switch group 204 by a CXL bus, and the CXL switch group 204 is used to connect the CXL memory device group 202 to the CXL host group 206, where the CXL memory device group 202 may include multiple CXL memory devices, the CXL switch group 204 may include one or more CXL switch groups, and the CXL host group 206 may include one or more CXL hosts.
With the rapid development of the internet and cloud computing services, there is a growing demand for performance and availability of cloud infrastructure. The novel CXL memory device can effectively expand the memory capacity of the cloud infrastructure, so that the performance is improved. However, if the memory failure rate is high, the failed memory will affect the operation of the system, and the most serious result is that the crash server of the system is directly caused, thereby causing serious loss to related personnel.
In order to reduce the impact of CXL memory failures, a set of management and fault tolerant designs for CXL memory are required to improve system availability. In the related art, memory fault tolerance technologies such as a memory hot standby technology and the like can be adopted for acute memory management. In the memory hot standby technology, the hot standby memory (which may be a memory bank) is not used under normal conditions, and when the number of failures of the working memory reaches the maximum value of the preset value ECC (Error Correcting Code, error checking and correcting), the system automatically transmits the data in the failed memory to the hot standby memory, so that the failed memory is not used any more. However, in the memory hot standby technology, the hot standby memory occupies a server slot, so that the available memory space of the server is reduced, and the utilization rate of the memory of the server is reduced.
In addition, other CXL memory fault tolerance modes can be adopted, for example, an ECC technology is adopted to correct 1-2-bit memory errors, memory information can be obtained through a system self-contained command or BMC (Board Management Controller, baseboard management controller), and the information profile of each memory is checked. However, the above approach only obtains an overview of the memory information, does not provide a fault tolerant design related to the memory, and cannot avoid or reduce the loss caused by the memory failure. Or other memory fault tolerance techniques, such as memory mirroring techniques, may be employed. In the memory mirroring technology, memory data has two copies, so that data loss caused by memory faults is avoided, and meanwhile, the working memory and the mirror memory are not in the same channel, and data loss caused by memory channel errors is also avoided. However, although the memory mirroring technique does not affect the available memory space of the server, it is costly.
In order to at least partially solve the above technical problems and avoid serious loss caused by a CXL memory failure, the CXL memory fault tolerance method provided in this embodiment provides a fault tolerance design for managing a CXL memory and correlating to the CXL memory, and obtains a parameter value of a set of operating parameters of the CXL memory device for indicating an operating state of the CXL memory device, thereby predicting the operating state of the CXL memory device based on the obtained parameter value; if the running state of the CXL memory device is predicted to be abnormal, the memory data in the CXL memory device running abnormally can be migrated to the CXL memory device running normally, and the CXL memory device can still run during prediction, so that the memory data in the CXL memory device running abnormally can be migrated out before the CXL memory device running abnormally is down, and the influence of the device abnormality on a server system can be reduced.
Here, CXL is an interconnection technical standard, where the CXL 3.0 standard breaks through the limitation that a certain physical memory belongs to a certain server, and implements the capability of multiple machines to commonly access the same memory device on hardware. According to the CXL 3.0 specification, a CXL memory device can be connected to a CXL host through a CXL switch, acting as a logical device to the CXL host and exposing the memory capacity of this device. The system can switch the topological connection of the CXL switch according to the requirement, and determine the link relation between the CXL host and the CXL memory device.
Alternatively, the CXL memory in this embodiment may be a DDR (Double Data Rate Synchronous Dynamic Random Access Memory, double-rate synchronous dynamic random access memory) 5 memory, and compared with other DDR memories, the DDR5 memory has a grain level error correction mechanism added. During the running process of the server, the CPU (Central Processing Unit ) can perform read-write operation on the memory. In order to avoid ECC occurrence in the memory during normal operation of the server, the DDR5 memory may use ECS (Error Check and Scrub, error patrol and correction) functions to patrol and correct ECC errors of the granule in an idle area without read and write operations.
Fig. 3 is a flowchart of an alternative CXL memory fault tolerance method according to an embodiment of the application, as shown in fig. 3, the flowchart includes steps S302 to S306 as follows.
Step S302 obtains parameter values of a set of operating parameters of CXL memory devices in the CXL memory device group.
For the aforementioned server system, it may contain a control device, which may be one of the CXL host groups 206, or another device independent of the CXL memory device group 202, CXL Switch group 204, and CXL host group 206, such as an FM (Fabric Manager, a tool for managing switches) responsible for managing devices on the CXL bus, including configuring CXL switches to connect CXL devices on the bus to a particular host. The control device may obtain parameter values of a set of operation parameters of each CXL memory device in the CXL memory device group, and may obtain parameter values of a corresponding set of operation parameters for different CXL memory devices, where each CXL memory device in the CXL memory device group is in an operation state, and one set of operation parameters is used to indicate an operation state of the corresponding CXL memory device, for example, one or more of an erasing speed, an operation frequency, an operation average temperature, and an operation average voltage, and may further include other operation parameters, which are not limited in this embodiment.
Alternatively, the control device may communicate with all or part of the CXL memory device group 202, the CXL switch group 204, and the CXL host group 206 via a CXL bus or a designated bus, where the designated bus is a different bus than the CXL bus. Correspondingly, the control device may obtain parameter values of a corresponding set of operating parameters from the CXL memory devices in the CXL memory device group via the CXL bus or the specification bus. The designated bus may belong to the network used for interaction between the set of CXL memory devices 202, the set of CXL switches 204, and the set of CXL hosts 206, or may belong to a separate network, such as a management network. Correspondingly, the control device may obtain parameter values of a corresponding set of operating parameters from the CXL memory devices in the CXL memory device group via the management network, e.g., the control device may obtain parameter values of a set of operating parameters for each CXL memory device from each CXL memory device in the CXL memory device group via the management network.
Alternatively, the acquiring of the parameter values of the set of operating parameters of the CXL memory device may be performed periodically or may be event-triggered, and when the parameter values of the set of operating parameters of the CXL memory device are acquired, the parameter values of the set of operating parameters of all or part of the CXL memory devices in the CXL memory device group 202 may be acquired in a polling manner or other manners, which in this embodiment is not limited in the acquiring manner and the acquiring timing of the parameter values of the set of operating parameters.
For example, for a server system as shown in fig. 4, the server system can include: host 1 to host N, CXL Fabric (comprising one or more CXL switches), CXL MEM (memory) 1 to CXL MEM N, and FM. The FM may obtain the operating parameter information (i.e., parameter values of a set of operating parameters) of the memory from the CXL memory device via the management network, including: the erasing speed v, the running frequency f, the running average temperature te and the average voltage vo.
Step S304, according to the acquired parameter values of a group of operation parameters, the operation states of CXL memory devices in the CXL memory device group are predicted.
According to the obtained parameter values of a set of operation parameters, the control device may predict an operation state of the corresponding CXL memory device, where the prediction may be a predicted memory health state or a predicted memory lifetime. The mode of predicting the running state may be: the method includes the steps of fusing parameter values of a group of operation parameters to obtain state reference values of corresponding CXL memory devices, and predicting the operation states of the corresponding CXL memory devices based on the obtained state reference values, wherein the method can also be as follows: the obtained parameter values of a group of operation parameters are input into a pre-trained prediction model to obtain a prediction result output by the prediction model, and the prediction mode is not limited in the implementation.
Alternatively, the parameter values fused in the fusing step may include the parameter value acquired at the current time, and may also include the parameter value acquired in history, where the current time is the latest parameter value used for prediction, that is, if the parameter value acquired at a certain time and the parameter value acquired before the time are to be used for prediction, the parameter value acquired at the time is the parameter value acquired at the current time. The fusion manner adopted in the fusion step can include, but is not limited to, at least one of the following: convolution fusion, weighted fusion, and other fusion methods are also possible.
In step S306, in the case where it is predicted that there is an abnormal memory device with an operation abnormality in the CXL memory device group, the migration operation is controlled to be performed on the memory data in the abnormal memory device.
The running state prediction may be performed separately for each CXL memory device. If the CXL memory devices in the CXL memory device group are predicted to run normally, the CXL memory device group can be ignored, the parameter values of a group of operation parameters acquired at the time can be saved, and the saved parameter values can be used for the following operation state prediction process. If it is predicted that there is an abnormal CXL memory device in the CXL memory device group, that is, an abnormal memory device, the control device may control the execution of the migration operation on the memory data in the abnormal memory device, and after the execution of the migration operation, the memory data in the abnormal memory device may migrate to the target memory device in the CXL memory device group having a normal operating state, where the migration operation may be executed by the control device directly controlling the abnormal memory device and the target memory device, or may be executed by the CXL host (that is, the target host) connected to the abnormal memory device, or may be executed by the control device and the target memory device in cooperation. In addition to migrating the memory data in the abnormal memory device to the target memory device, the configuration information related to the memory data in the abnormal memory device may be synchronously modified, or after the related configuration information is deleted, the modification of the configuration information is performed based on the access trigger to the memory data in the abnormal memory device. After the data migration is completed, the abnormal memory device may be removed from the CXL memory device group to avoid the impact of the abnormal memory device on the operation of the server system.
Optionally, the CXL memory devices in CXL memory device group 202 allow for virtual machine systems running on CXL hosts assigned to CXL host group 206, and the CXL hosts in CXL host group 206 may also have a VMM (Virtual Machine Monitor, virtual machine manager) running thereon. For the target host, the virtual machine manager running on the target host is the target virtual machine manager, and the virtual machine system allocated to the abnormal memory device is the target virtual machine system. Under the condition that the abnormal memory device is predicted, the memory migration for the target virtual machine can be controlled, and the memory migration for the target virtual machine can be controlled by the method comprising the following steps: the control performs a migration operation on the memory data in the abnormal memory device, and may further include other information modification operations, for example, modifying configuration information related to the memory data in the abnormal memory device, and the like.
Through the steps, parameter values of a group of operation parameters of CXL memory devices in the CXL memory device group are obtained, wherein the CXL memory devices in the CXL memory device group are all in an operation state, and the group of operation parameters are used for representing the operation state of the corresponding CXL memory devices; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of a group of operation parameters; under the condition that abnormal memory devices running abnormally in the CXL memory device group are predicted, controlling to execute migration operation on memory data in the abnormal memory devices so as to migrate the memory data in the abnormal memory devices to target memory devices with normal running states in the CXL memory device group, wherein the abnormal memory devices after data migration are removed from the CXL memory device group, the problem that the CXL memory fault tolerance method in the related art has low memory utilization rate of a server due to the fact that a hot standby memory occupies a server slot is solved, and the memory utilization rate of the server is improved.
In some example embodiments, obtaining parameter values for a set of operating parameters of CXL memory devices in a set of CXL memory devices includes: and periodically executing parameter value acquisition operation of a group of operation parameters on CXL memory devices in the CXL memory device group through a designated bus by the control device to obtain the acquired parameter values of the group of operation parameters.
In this embodiment, the server system further includes a control device, where the control device may perform a parameter value collection operation of a set of operating parameters on CXL memory devices in the CXL memory device group, to obtain a collected parameter value of a set of operating parameters, where the collected parameter value of a set of operating parameters includes a parameter value of a set of operating parameters corresponding to each CXL memory device. Alternatively, the control device may be connected to the CXL memory device group, the CXL switch group, and the CXL host group, respectively, via a designation bus that is different from the CXL bus, for example, the designation bus may be a management network bus. Correspondingly, the parameter value acquisition operation may be performed via a designated bus, i.e. the control device may perform the parameter value acquisition operation of a set of operating parameters on the CXL memory devices of the group of CXL memory devices via the designated bus. Alternatively, the parameter value acquisition operation may be performed periodically, i.e., the control device may perform the parameter value acquisition operation of a set of operating parameters on CXL memory devices in the set of CXL memory devices periodically.
For example, for the server system shown in fig. 4, the FM cycle performs a state detection step to monitor the CXL memory state in real time, the state detection step including: and acquiring the working parameter information of the memory from the CXL memory device through the management network.
According to the method and the device, the control device obtains the parameter value of the operation parameter of the CXL memory device through the appointed bus, the CXL bus used for interaction between the CXL devices is not required to be occupied, and the efficiency and the success rate of information obtaining can be improved.
In some exemplary embodiments, predicting the operating states of the CXL memory devices in the group of CXL memory devices, in addition to the currently acquired parameter values, further includes: the parameter value acquired before the current time. Correspondingly, predicting the operation state of the CXL memory device in the CXL memory device group according to the acquired parameter value for the set of operation parameters, including: and predicting the operating states of CXL memory devices in the CXL memory device group according to the parameter values of a group of operating parameters acquired in a plurality of continuous periods.
Here, the parameter values of the set of operation parameters acquired in the consecutive plurality of periods may include parameter values of the set of operation parameters acquired in the t-th period and parameter values of the set of operation parameters acquired in L acquisition periods before the t-th period, L being a positive integer greater than or equal to 1, and t being a positive integer greater than L.
According to the method and the device, the operation state of the CXL memory device is predicted based on the parameter values of the operation parameters of the CXL memory device obtained for a plurality of times, so that the accuracy of the operation state prediction of the CXL memory device can be improved.
In some exemplary embodiments, predicting an operating state of a CXL memory device in a group of CXL memory devices based on parameter values for a set of operating parameters collected over a continuous plurality of cycles, comprises: and predicting the current operating state of CXL memory devices in the CXL memory device group according to the parameter values of the group of operating parameters acquired in the t-th period, the parameter values of the group of operating parameters acquired in the t-1 th period and the parameter values of the group of operating parameters acquired in the t-2 th period.
In this embodiment, l=2, i.e., the operation states of the CXL memory devices in the CXL memory device group are predicted from the parameter values of the set of operation parameters acquired in the t-th cycle and the parameter values of the set of operation parameters acquired in the two cycles preceding the t-th cycle (i.e., the t-1 th cycle and the t-2 th cycle). The predicted operating state may be an operating state of a CXL memory device in the group of CXL memory devices within the t-th cycle.
Optionally, in the present embodiment, the set of operating parameters includes an erase speed v, a runtime frequency f, a runtime average temperature te, and a runtime average voltage vo. The parameter values of the set of operation parameters collected in the nth period are v n、fn、ten and vo n, respectively, n is a positive integer greater than or equal to 1, for example, n may be t, t-1, t-2, etc.
According to the method and the device, the operation state of the CXL memory device is predicted based on the parameter values of the operation parameters of the CXL memory device acquired in three continuous periods, so that the accuracy and convenience of the operation state prediction of the CXL memory device can be improved.
In some exemplary embodiments, when predicting the operation states of the CXL memory devices in the CXL memory device group, the parameter values for a set of operation parameters collected during a plurality of consecutive periods may be fused to obtain a parameter fusion value; determining a state reference value H t of CXL memory devices in the CXL memory device group according to the parameter fusion value, the parameter value of at least part of the operation parameters in the group of operation parameters acquired in the t-th period and the line error unit number reference value; the operating states of the CXL memory devices in the group of CXL memory devices are predicted based on the state reference value for the CXL memory devices in the group of CXL memory devices and the maximum number of row error units for the t-th cycle for the CXL memory devices. Here, each CXL memory device can perform the operation state prediction in the foregoing manner, and the data used corresponds to each CXL memory device.
The reference value of the number of row error units is a value obtained by dividing a difference value between the maximum number of row error units of the corresponding CXL memory device in the t-th period and a configuration average value of the number of row error units of the corresponding CXL memory device by a configuration standard deviation of the number of row error units of the corresponding CXL memory device, and the maximum number of row error units is the number of error units of the row with the maximum error of the corresponding CXL memory device. Here, the configuration average value may be an average value of the number of row error units of the corresponding CXL memory device configured in advance, which may be obtained from the corresponding CXL memory device, for example, extracted from configuration information obtained from the corresponding CXL memory device, or may be obtained from another device.
The means for fusing the parameter values of a set of operating parameters acquired over successive periods may be a weighted fusion of the parameter values. Optionally, in order to improve the rationality of information fusion, the target data matrix D t may be constructed according to a set of parameter values of the operation parameters acquired in a plurality of continuous periods, and the target data matrix D t is convolved with a preset convolution kernel I to obtain a target convolution value G t, where the target convolution value G t is the fusion parameter value described above.
Alternatively, in the present embodiment, the first and second embodiments,Feature vector/>, of the nth cycle, corresponding to a set of operating parametersV is the configuration erasing speed, F is the configuration operation maximum frequency, TE is the configuration operation average temperature, and VO is the configuration operation average voltage. The preset convolution kernel is a convolution kernel configured based on an impact factor corresponding to an operation parameter in a set of operation parameters, and the impact factor corresponding to the operation parameter in the set of operation parameters is used to represent a degree of association between the operation parameter in the set of operation parameters and a memory failure of the corresponding memory device.
Alternatively, the configuration parameter value (e.g., the configuration erasing speed V, the configuration operation maximum frequency F, the configuration operation average temperature TE, the configuration operation average voltage VO, etc.) of each operation parameter of each CXL memory device may be a parameter value of each operation parameter of the corresponding CXL memory device that is preconfigured, and may be obtained from the corresponding CXL memory device, for example, the configuration information of each CXL memory device obtained from each CXL memory device, or may be obtained from another device.
For example, before executing the state detection step, FM reads configuration information of the CXL memory device via the management network, including a configuration erasing speed V of the memory, a configuration operation maximum frequency F of the memory, a configuration operation average temperature TE of the memory, and a configuration operation average voltage VO of the memory; obtaining a row error cell number average for CXL memory devicesAnd standard deviation S.
Here, there is an SPD (SERIAL PRESENCE DETECT ) on CXL memory devices, the line error cell number average of CXL memory devicesAnd standard deviation/>The configuration information may be set according to the test and stored in the SPD of the CXL memory device in advance. SPD is an erasable EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ) on a memory module in which many important information of the memory is recorded, such as the chip and module manufacturer of the memory, operating frequency, operating voltage, speed, capacity, voltage and row and column address bandwidth parameters. In addition, the CXL memory device can also store the used time of the device in the SPD.
Referring to fig. 5 for a schematic of the structure of a CXL memory device, the CXL memory device can include: CXL ports, address mapping units, access permission units, memory control units, DRAM (Dynamic Random Access Memory ), SPDs, monitoring and alarm units, communication units, memory management units.
When the state detection step is executed, the FM periodically acquires the working parameter information of the memory from the CXL memory device through the management network, wherein the working parameter information comprises the erasing speed v, the running time frequency f, the running time average temperature te, the average voltage vo, v t、ft、tet and vo t respectively represent the erasing speed, the running time frequency, the running time average temperature and the average voltage acquired in the t-th period. Is provided withFeature vectors representing the t-th period, constructing a data matrix/>And if t is more than 2 and less than or equal to 2, continuously acquiring the working parameter information of the next period.
For data matrixAnd (3) performing convolution calculation, wherein a calculation formula is shown in a formula (1).
(1)
Wherein, I is the convolution kernel.
The convolution kernel I is shown below:
Wherein, Is set based on the erasure speed, the running frequency, the running average temperature and the running average voltage impact factors on the memory faults.
In determining the state reference value H t of the CXL memory devices in the CXL memory device group, the parameter fusion value (i.e., the target convolution value), the parameter value of at least a portion of the operating parameters collected during the t-th period, and the row error unit count reference value may be directly weighted and fused. The at least some operating parameters may be one or more operating parameters from a set of operating parameters, which may include, for example, a runtime average temperature and a runtime average voltage.
Alternatively, the state reference value H t of the CXL memory devices in the group of CXL memory devices can be calculated by equation (2):
(2)
Where μ, δ, ε, and ρ are the set weighting coefficients, E t is the maximum number of row error units for the corresponding CXL memory device in the t-th cycle, And S is the configuration standard deviation of the row error unit number of the corresponding CXL memory device, and the maximum row error unit number is the error unit number of the row with the largest error of the corresponding CXL memory device. Here, the calculated state reference value H t of the CXL memory device may be a health indicator of the CXL memory.
For example, FM reads the number of erroneous units of the most erroneous row in the memory granule detected by the CXL memory device ECS (Error Check and Scrub, error patrol and correction) function through the management network. When the ECS record of the CXL memory device is read, the CXL memory clears the record and then counts the next cycle.
According to the embodiment, the vector processing is carried out on the parameter values of the operation parameters of the CXL memory device collected in a plurality of continuous periods to obtain a data matrix to be processed, the convolution processing is carried out on the data matrix based on the influence factors of different operation parameters, the convolution value obtained by the convolution, the parameter value of at least part of the operation parameters and the line error unit number reference value are determined, the state reference value of the CXL memory device is determined, the operation state of the CXL memory device can be more accurately represented by the determined state reference value, and the accuracy of the operation state prediction of the CXL memory device is improved; meanwhile, the current running state of the CXL memory device is predicted, and the maximum row error unit number is also referred, so that the running state prediction accuracy of the CXL memory device can be further improved.
In some exemplary embodiments, the server system further comprises a control device coupled to the set of CXL memory devices, the set of CXL switches, and the set of CXL host, respectively, via a designated bus that is different from the CXL bus, similar to the previous embodiments. Has been described and will not be described in detail herein.
In this embodiment, before the operation states of the CXL memory devices in the CXL memory device group are predicted, the configuration information of the CXL memory devices in the CXL memory device group may be acquired from the CXL memory devices in the CXL memory device group by the control device. Configuration information of CXL memory devices in the CXL memory device group is included in SPD data of the CXL memory devices in the CXL memory device group.
The configuration information of the CXL memory devices in the group of CXL memory devices may include configuration information for indicating at least one of: configuring the erasing speed V, configuring the maximum frequency F, configuring the average temperature TE, configuring the average voltage VO and the average value of the number of error unitsAnd a configuration standard deviation S of the number of row error cells.
In this embodiment, the control device may acquire the configuration information of the CXL memory devices in the CXL memory device group via the foregoing specification bus. The timing of the configuration of the first memory device may be at the time of the start-up of the control device or may be any timing after the start-up of the control device, for example, in response to an information acquisition instruction, in response to detection of a newly added CXL memory device, in response to arrival of a specified acquisition time, or the like.
According to the embodiment, the configuration parameter value of the operation parameter is indicated through the configuration information of the CXL memory device, so that the convenience and controllability of the operation state prediction of the CXL memory device can be improved.
In some example embodiments, obtaining parameter values for a set of operating parameters of a CXL memory device may be performed by traversing a CXL device information list, i.e., traversing the CXL device information list for recording device information for the CXL device in a server system, where the device types of the CXL device in the server system may include the CXL memory device, the CXL switch, and the CXL host, and obtaining parameter values for the set of operating parameters of the traversed CXL memory device.
Here, the CXL device information list may be preconfigured, that is, the CXL device information list may be read from a configuration file of the server system, or the device information of the CXL device in the server system may be acquired by interacting with the CXL device in the server system, and the CXL device information list may be created based on the acquired device information. Correspondingly, before the control device obtains the configuration information of the CXL memory devices in the CXL memory device group from the CXL memory devices in the CXL memory device group, the method further comprises: acquiring device information from CXL devices in a server system through a designated bus by a control device to obtain the device information of the CXL devices in the server system; and constructing a CXL device information list based on the acquired device information of the CXL devices in the server system.
The device type of the CXL device in the server system comprises CXL memory devices, CXL switches and CXL hosts, and the device type of one CXL device in the server system is one of the CXL memory devices, the CXL switches and the CXL hosts. The set-up CXL device information list contains the device information for the CXL memory devices in the CXL memory device group, the device information for the CXL switch in the CXL switch group, and the device information for the CXL host in the CXL host group.
For example, the FM obtains information of devices such as CXL memory devices, CXL hosts, and CXL switches via the management network in MCTP (MANAGEMENT COMPONENT TRANSPORT PROTOCOL ) protocol (or other protocols), and builds a CXL device information list that may include the type of device (including class 4: hosts, memory, switches, or others), vendor ID (identifier), product ID, and necessary device information thereof. In addition, the information such as the configuration erasing speed, the configuration operation maximum frequency, the configuration operation time average temperature, and the configuration operation time average voltage of the CXL memory device described above may be stored in the device information of the CXL device information list.
According to the embodiment, by acquiring the device information from the CXL device in the server system and constructing the CXL device information list based on the acquired device information, and by traversing the CXL device information list to predict the operation state of the CXL memory device, the comprehensiveness of the operation state prediction of the CXL memory device can be improved (that is, the operation states of all CXL memory devices in the server system can be predicted).
In some exemplary embodiments, S t、St-1 and S t-2 are each column vectors containing 4 rows and 1 column, and the target data matrix is a4 row and 3 column data matrix obtained by combining S t、St-1 and S t-2 in accordance with a periodic timing sequence. Correspondingly, before performing convolution processing on the target data matrix D t by using the preset convolution kernel I to obtain a target convolution value G t, the method further includes: and constructing a preset convolution kernel I.
Here, the preset convolution kernel is a matrix of 4 rows and 3 columns, each row in the preset convolution kernel is matched with one operation parameter in a set of operation parameters, the element value of the first element of each row in the preset convolution kernel I is a first coefficient multiplied by an influence factor corresponding to the matched operation parameter, the element value of the second element is 0, the element value of the third element is a second coefficient multiplied by an influence factor corresponding to the matched operation parameter, the second coefficient is a positive integer greater than or equal to 2 (for example, the second coefficient=2), and the first coefficient is an opposite number of the second coefficient (for example, the first coefficient= -2).
For example, in the convolution kernel I described above,
,/>And/>The factors of the erasing speed, the running frequency, the running average temperature and the running average voltage on the memory faults are respectively. /(I)
By means of the method, the device and the system, the data matrix is built by combining the parameter values of the operation parameters of the CXL memory device collected in three periods, and the preset convolution kernel is built, so that accuracy of operation state prediction can be guaranteed, and meanwhile the data quantity required to be processed can be reduced.
In some example embodiments, the current operating states of the CXL memory devices in the group of CXL memory devices may be predicted using a preset threshold, which may include a first specified threshold corresponding to a state reference value for the CXL memory devices in the group of CXL memory devices, and a second specified threshold corresponding to a maximum number of row error units for the t-th cycle of the CXL memory devices. Correspondingly, predicting the current running state of the CXL memory devices in the CXL memory device group based on the state reference value H t of the CXL memory devices in the CXL memory device group and the maximum row error unit number E t for the t-th cycle, comprising: determining that the corresponding CXL memory device runs abnormally under the condition that the state reference value H t of the CXL memory devices in the CXL memory device group is larger than or equal to a first specified threshold value; and determining that the corresponding CXL memory device runs abnormally under the condition that the maximum row error unit number E t of the t-th period of the CXL memory devices in the CXL memory device group is larger than or equal to a second specified threshold.
Here, it is determined that the CXL memory device is operating abnormally as long as any one of the state reference value of the CXL memory device being greater than or equal to the first specified threshold value and the maximum number of row error units of the CXL memory device at the t-th cycle being greater than or equal to the second specified threshold value is satisfied. The first specified threshold and the second specified threshold may be the same or different. The first specified threshold and the second specified threshold corresponding to different CXL memory devices may be the same or different, and this embodiment is not limited thereto.
For example, FM determines the health status of each CXL memory device, and the determination criteria are: if it isOr (b)The CXL memory device is considered to be in an abnormal state; otherwise, the CXL memory device is considered to be in a normal working state. Wherein/>Is the maximum threshold of memory error unit number in period,/>Is the maximum threshold of the health index.
According to the method and the device, the running state of the CXL memory device is predicted based on the state reference value of the CXL memory device and the corresponding threshold value, and the maximum row error unit number of the CXL memory device and the corresponding threshold value, so that the running state prediction convenience of the CXL memory device can be improved.
In some exemplary embodiments, the server system further includes a control device, similar to those in the previous embodiments, the CXL memory devices in the CXL memory device group allowing a virtual machine system running on a CXL host assigned to the CXL host group, the CXL host in the CXL host group also having a VMM running thereon. Since the access latency of the CXL memory is higher than the local memory, the CXL memory can be used entirely for the virtual machine system, managed and allocated by the VMM of the server.
In this embodiment, in a case where it is predicted that an abnormal memory device having an operation abnormality exists in the CXL memory device group, controlling the execution of the migration operation on the memory data in the abnormal memory device includes: under the condition that abnormal memory devices with abnormal operation in the CXL memory device group are predicted, determining a target host connected with the abnormal memory devices in the CXL host group through a control device; and sending a first notification message to the target virtual machine manager through the control equipment so as to notify the target virtual machine manager to execute migration operation on the memory data in the abnormal memory equipment.
If an abnormal memory device with abnormal running state is predicted, the control device may determine a target host connected to the abnormal memory device in the CXL host group, and the method for determining the CXL host connected to the abnormal memory device may be by searching a device node adjacency table, where the device node adjacency table is used to record the information and connection condition of the CXL device, and the construction manner of the device node adjacency table is similar to that of the foregoing CXL device information list, which is not described herein.
The virtual machine manager running on the target host is the target virtual machine manager. After determining the target host, the control device may send a first notification message to the target virtual machine manager, so as to perform a migration operation on the memory data in the abnormal memory device through the target virtual machine manager, where the first notification message carries device information of the abnormal memory device, so that the target virtual machine manager may conveniently determine the CXL memory device to which the memory data is to be migrated.
For example, FM queries the device node adjacency list, searches out the CXL host connected to the CXL memory with abnormal health, and sends a message to the server VMM via the management network, the message including: CXL memory device information and health index, and the connection topology of the CXL memory device and the host computer. After receiving the message, the server VMM may migrate the memory data in the abnormal CXL memory device to the normal CXL memory device.
According to the embodiment, the virtual machine manager controls the memory data migration operation in the CXL memory device which is abnormal to execute by sending the notification message to the virtual machine manager of the CXL host connected with the CXL memory device which is abnormal to execute the memory data migration operation, so that the convenience of memory data migration can be improved.
In some example embodiments, the CXL host connected to the abnormal memory device may be determined by the control device querying a device node adjacency table indicating a connection relationship between the CXL memory devices in the group of CXL memory devices and the CXL hosts in the group of CXL hosts. For an abnormal memory device with abnormal predicted running state, determining, by a control device, a target host connected to the abnormal memory device in the CXL host group, including: and inquiring the device node adjacency list through the control device, determining the CXL host connected with the abnormal memory device in the CXL host group, wherein the determined CXL host is the target host.
The device node adjacency list may be pre-configured or acquired from other devices. In order to improve the accuracy of determining the device node adjacency list, the control device can construct the device node adjacency list based on the connection relationship between the CXL host and the CXL memory device, which are acquired in real time, through interaction with the CXL switch of the CXL switch group. Correspondingly, before determining, by the control device, the target host connected to the abnormal memory device in the CXL host group, the method further includes: acquiring a set of specified information of the CXL switch group from the CXL switch of the CXL switch group through a control device; and establishing a device node adjacency list based on the acquired set of specified information of the CXL switches of the CXL switch group.
Here, the control device may obtain a set of designation information of the CXL switch group from the CXL switch of the CXL switch group via the management network (e.g., via the designation bus). The set of specification information may include port information and device information for port connection, and may also include configuration information. Based on the acquired set of specifying information for the CXL switch of the CXL switch group, the control device can build a device node adjacency table. The execution timing of the above-described acquisition operation may be similar to the acquisition timing of the CXL device information list, or may be after the CXL device information list is acquired, which is not limited in this embodiment.
For example, a command is sent to the CXL Switch through the management network to obtain the configuration, port, and port connection device information; and establishing a device node adjacency list.
According to the embodiment, the device information of the port and the port connection of the CXL switch is obtained through interaction with the CXL switch, so that the device node adjacency list is built based on the obtained information, and the timeliness of building the device node adjacency list can be improved.
In some exemplary embodiments, to avoid an exception caused by data migration, a memory migration operation may be performed for the virtual machine system, where the memory migration operation may involve modification of configuration information or the like in addition to migrating the memory data, so as to implement allocation of the CXL memory device to which the memory data is migrated to the corresponding virtual machine system.
The virtual machine system allocated to the abnormal memory device is a target virtual machine system. After sending the first notification message to the target virtual machine manager by the control device, the method further comprises: and determining a target virtual machine system allocated by the abnormal memory device through the target virtual machine manager, and executing memory migration operation for the target virtual machine system so as to migrate the memory data in the abnormal memory device into the target memory device, and allocating the target memory device to the target virtual machine system.
The target virtual machine manager may determine that the virtual machine system allocated by the abnormal memory device may be implemented by querying a target information table, where the target virtual machine manager stores a target information table, where the target information table is used to record the CXL memory device connected to the target host and the virtual machine system allocated by the CXL memory device connected to the target host, and the target virtual machine manager may query the target information table to determine the target virtual machine system allocated by the abnormal memory device.
For example, a CXL memory management information table (i.e., a target information table) is stored in the server VMM, and records all of the CXL memory device information connected to the CXL host and the memory allocation information for the CXL memory device, including the relative address, size, allocation status (free or allocated) of the memory, and UUID (Universally Unique Identifier, universally unique identification code) of the corresponding virtual machine. As shown in fig. 6.
After receiving the message, the server VMM inquires a CXL memory management information table to acquire the UUID and the memory occupation condition of the virtual machine system using the device memory, and searches a proper memory for the virtual machine system to finish memory migration.
According to the embodiment, the stored CXL memory management information table is queried through the virtual machine manager to determine the virtual machine system allocated to the abnormal CXL memory device and complete memory migration for the virtual machine system, so that the convenience and accuracy of memory migration can be improved.
In some exemplary embodiments, before performing, by the target virtual machine manager, the memory migration operation for the target virtual machine system, the method further comprises: inquiring a target information table through a target virtual machine manager to find whether CXL memory equipment meeting memory data migration conditions exists in CXL memory equipment connected with a target host; and under the condition that the CXL memory device meeting the memory data migration condition is found, determining the found CXL memory device as the target memory device.
To determine the CXL memory device to which the memory migration is to be migrated, the target virtual machine manager may first query the target information table to find whether there is a CXL memory device that satisfies the memory data migration condition among the CXL memory devices connected to the target host, that is, the CXL memory device (the CXL memory device that is operating normally) whose free memory in the free state allows memory data in the abnormal memory device to be written. If CXL memory devices meeting the memory data migration conditions are found, the found CXL memory devices are target memory devices.
If the CXL memory device meeting the memory data migration condition is not found, the target virtual machine manager can send a device application request message to the control device to apply for the target host to access the CXL memory device meeting the memory data migration condition, i.e. to allocate a new CXL memory device. In response to the received device application request message, the control device may determine, from among the CXL memory devices that are unoccupied (or have a portion of the memory area unoccupied), a CXL memory device that satisfies the memory data migration condition, and if there is a CXL memory device that satisfies the memory data migration condition, may indicate through the device application response message.
For the target virtual machine manager, in the case of receiving a device application response message returned by the control device in response to the device application request, it may determine the CXL memory device indicated by the device application response message as the target memory device. If no CXL memory device meeting the memory data migration condition exists, the execution of the memory migration operation can be ended, and an abnormality alarm is carried out in various modes so as to prompt the abnormal CXL memory device to be updated as soon as possible.
For example, the server VMM finds the appropriate memory in the CXL memory management information table for the virtual machine system. If the CXL memory of the local computer is insufficient, applying for a new CXL memory device to the FM, searching a device node adjacency list by the FM, finding out a proper memory device, and configuring the CXL switch to access the new CXL memory device to the host computer. The VMM begins memory migration for the virtual machine system.
According to the embodiment, the CXL memory device meeting the memory data migration condition is searched in the CXL memory device of the host, and then the CXL memory device meeting the memory data migration condition is applied to the control device to be accessed to the host, so that the convenience and the success rate of memory device searching can be improved.
In some example embodiments, memory data in a CXL memory device may be maintained in the form of memory pages, while the virtual machine system records the correspondence between virtual and physical addresses (i.e., the translation relationship between virtual and physical addresses) of the memory pages allocated to it via EPT (Extended Page Table ). Accordingly, the memory migration includes not only the Qin migration of data in the memory page, but also the update of the translation record in the EPT.
Accordingly, performing, by the target virtual machine manager, a memory migration operation for the target virtual machine system, including: and migrating the data in the first memory page of the abnormal memory device to the second memory page of the target memory device through the target virtual machine manager, and updating the first conversion record corresponding to the abnormal memory device in the target expansion page table of the target virtual machine system to the second conversion record so as to complete the memory migration of the target virtual machine system.
The extended page table of the target virtual machine system is a target extended page table. The target virtual machine manager may migrate data in a first memory page of the abnormal memory device to a second memory page of the target memory device, and update a first translation record corresponding to the abnormal memory device in a target extension page table of the target virtual machine system to a second translation record, so as to complete memory migration of the target virtual machine system. Here, the first translation record is used to indicate a correspondence between a memory page address of the first memory page and a physical memory address of the first memory page, and the second translation record is used to indicate a correspondence between a memory page address of the second memory page and a physical memory address of the second memory page.
Alternatively, the update operation of the first translation record may be performed directly after the data migration in the memory page is completed, or may be performed in multiple stages, where the operation of one stage is performed after the data migration in the memory page is completed, and the operation of the other stage may be performed by event triggering, for example, based on the triggering of the access operation to the first memory page, or may be performed in other manners, which are not limited in this embodiment.
According to the embodiment, the memory migration is completed by combining the migration of the data in the memory page with the updating of the record in the extended page table, so that the integrity and the comprehensiveness of the memory migration can be improved.
In some example embodiments, during memory migration, updates to data in a first memory page may not accurately be embodied in data in a second memory page, e.g., updates to data already migrated to the second memory page may not be synchronized into the second memory page. In order to ensure accuracy and timeliness of the data in the second memory page, updating of the data in the memory page of the abnormal memory device in the process of migrating the data in the first memory page to the second memory page can be recorded through the memory page updating record.
Correspondingly, after migrating the data in the first memory page of the abnormal memory device to the second memory page of the target memory device by the target virtual machine manager, the method further includes: acquiring a memory page update record through a target virtual machine manager; and updating the data in the memory page of the target memory device according to the memory page update record. The data in the memory page of the target memory device after the update is the latest data.
Here, the memory page update record may only record what kind of update is performed on the data in the first memory page, and the data in the first memory page may not be updated. In the process of updating the data in the memory page of the target memory device according to the memory page update record, if a new update operation exists, the data can be recorded as a new memory page update record until all the memory page update records are used, so that the situation of data update disorder is avoided.
According to the embodiment, the data in the memory page is updated in the memory migration process through the memory page update record, and the migrated data is updated according to the recorded data update mode, so that the accuracy and timeliness of the data in the memory page can be ensured.
In some example embodiments, the updating of the first translation record in the target extended page table may be triggered based on an access operation to data of the first memory page. Before updating the first translation record corresponding to the abnormal memory device in the target extended page table of the target virtual machine system to the second translation record, the method further includes: and adding a third conversion record in the memory conversion relation table through the target virtual machine manager, and deleting the first conversion record from the target extended page table. Here, the memory conversion relation table is used to record a correspondence between a memory page address (virtual memory address) of a memory page to which data of the CXL memory device having an abnormal state has been migrated and a memory page address of a memory page to which the data has been migrated.
After the deletion operation of the first conversion record, since there is no conversion record corresponding to the memory page address of the first memory page, if there is an access operation to the memory page address of the first memory page, a target page-missing exception is triggered. In response to the obtained target page fault abnormality, the target virtual machine manager may traverse the memory translation relationship table based on the memory page address of the first memory page to obtain a third translation record. Based on the memory page address of the second memory page in the third translation record, the target expansion page table may be updated, or the second translation record may be added to the target expansion page, and by using the above manner, the first translation record is updated to be the second update record.
For example, firstly, copying data in a memory page of an abnormal CXL memory device into a new memory page, and establishing an abnormal memory conversion relation table, wherein the abnormal memory conversion relation table stores the conversion relation of memory addresses of the abnormal CXL memory device in an EPT page table of a virtual machine system and the corresponding relation of addresses of the new memory page and the old memory page; and simultaneously monitoring and recording all memory page modifications in the migration process. Then, deleting the memory address conversion relation of the abnormal CXL memory device from the EPT page table of the virtual machine system, searching the abnormal memory conversion relation table and the memory modification record in the memory migration process, and updating the memory modification to a new memory page.
When the virtual machine accesses the memory page of the abnormal CXL memory device, the CPU throws EPT Violation the exception because the EPT page table has no corresponding conversion, the exception is processed by the VMM, the VMM traverses the abnormal memory conversion table, and the conversion relation of GPAs (Guest-PHYSICAL ADDRESS, virtual machine physical addresses) corresponding to all the abnormal CXL memory devices is updated in the EPT page table. So far, the migration of the system memory of the virtual machine is completed. After the server VMM completes the memory migration of all the virtual machine systems, a message is sent to inform FM, and the message comprises CXL memory device information and information of successful memory migration of the virtual machine systems.
According to the embodiment, the memory page address of the memory page of which the data is migrated of the CXL memory device with abnormal state is recorded through the memory conversion relation table, the corresponding relation between the memory page address of the memory page to which the data is migrated and the memory page address of the memory page to which the data is migrated is obtained, the memory address conversion relation of the abnormal CXL memory device is deleted in the EPT page table, and when the page missing abnormality is triggered by a memory page access request, updating of the expansion page table is carried out, so that the data access accuracy is ensured, and meanwhile, the system burden caused by the data operation concentration is avoided.
In some exemplary embodiments, after controlling the performing of the migration operation on the memory data in the abnormal memory device, the method further includes: displaying abnormal alarm information on a designated display interface, wherein the abnormal alarm information is used for indicating the equipment position of abnormal memory equipment and the abnormal reason of the abnormal memory equipment; and sending a second notification message to the abnormal memory device, wherein the second notification message is used for notifying the abnormal memory device to alarm.
After the control performs the migration operation on the memory data in the abnormal memory device, in order to ensure that the abnormal CXL memory device can be found and replaced in time, the control device may display the abnormal alarm information on a designated display interface, where the designated display interface may be a display interface corresponding to the control device. The abnormality alert information may be sent in one or more ways, for example, by text, a warning light, a warning tone, or other means.
In addition, the control device may send a second notification message to the abnormal memory device to notify the abnormal memory device to alarm. After the abnormal memory device receives the alarm, the abnormal memory device can alarm by an indicator light, a prompt tone or other modes to indicate the self fault.
For example, after receiving the message that all CXL hosts connected with the abnormal CXL memory device are successful in memory migration of the virtual machine system, the FM can give an alarm to prompt operation and maintenance personnel to maintain, so that the reliability of the system is improved, and downtime and loss caused by memory faults are avoided. The alarm mode can be as follows: and sending an instruction through the management network to inform the CXL memory device of alarming, and simultaneously, providing a conspicuous alarming mark by the FM system state diagram interface, wherein the alarming information comprises the physical position and the abnormality reason of the CXL memory device. CXL memory equipment in abnormal state reports to the police through the pilot lamp, marks self trouble.
According to the embodiment, the abnormal alarm information is displayed on the appointed display interface and the abnormal CXL memory device is notified to alarm, so that operation and maintenance personnel can be prompted to maintain, the reliability of the system is improved, and downtime and loss caused by memory faults are avoided.
In some exemplary embodiments, after predicting the operating states of the CXL memory devices in the group of CXL memory devices based on the obtained parameter values for the set of operating parameters, the method further comprises: and under the condition that the CXL memory devices in the CXL memory device group are predicted to normally operate, saving parameter values of a group of operation parameters.
In this embodiment, for any CXL memory device, if it is predicted that the CXL memory device is operating normally, the control device may save the parameter values of the set of operating parameters of the CXL memory device obtained at this time, and the saved parameter values of the set of operating parameters of the CXL memory device may be used in a subsequent process of predicting the operating state of the CXL memory device. In addition, to save memory space, some of the saved parameter values may be deleted, e.g., parameter values that are not needed in the subsequent prediction process.
For example, when the CXL memory device is in a normal operating state, the step of periodically monitoring the CXL memory state (i.e., the aforementioned state detecting step) is repeatedly performed; and when the CXL memory device is abnormal, executing a memory migration step.
According to the embodiment, when the CXL memory device is predicted to run normally, the parameter values of the set of running parameters of the CXL memory device obtained at the time are stored, so that the convenience and accuracy of CXL memory device running state prediction can be improved.
The CXL memory fault tolerance method in this embodiment is explained below with reference to an alternative example. In this alternative example, the control device is an FM, and the target information table is a CXL memory management information table.
Aiming at the problem of serious loss caused by CXL memory equipment faults, the alternative example provides a CXL memory management and fault tolerance design method, which periodically monitors the state of CXL memory equipment and predicts the health state of the CXL memory equipment, when the health state of the CXL memory is predicted to be abnormal, a virtual machine system related to the CXL memory equipment is automatically migrated to the CXL memory equipment in the health state, and then an alarm prompts operation and maintenance personnel to maintain.
As shown in fig. 7, the flow of memory migration in this alternative example may include the following steps S702 to S716.
In step S702, FM detects an abnormality in the CXL memory device. Here, the FM cycle performs the aforementioned state detection step to monitor the health state of the CXL memory device in real time. Meanwhile, the FM may notify the VMM of the CXL host to which the abnormal CXL memory device is connected, so that the VMM performs memory migration for the virtual machine system to which the abnormal CXL memory device is assigned.
In step S704, the server VMM allocates new memory for the virtual machine system.
In step S706, the memory migration starts.
In step S708, the VMM copies the memory data to the new memory, establishes an abnormal memory translation table, and records the memory modification during migration.
In step S710, the VMM deletes the abnormal memory address translation relationship in the EPT and updates the memory modification to the new memory.
In step S712, the virtual machine accesses the exception memory, triggering the EPT page fault (Violation) exception.
In step S714, the VMM handles the exception and updates the EPT page table.
In step S716, the memory migration is completed.
Through the optional example, the FM collects the states of all CXL memory devices from the CXL host computer and the CXL memory devices through the management network bus, monitors and predicts the health states of the CXL memory devices in real time, and when predicting that one CXL memory device is in an abnormal state, the VMM of the notification server migrates a virtual machine system using the CXL memory device to the memory device in the health state, and meanwhile, alarms, prompts, inspects and maintains the virtual machine system, so that the reliability of the system can be improved, and downtime and loss caused by memory faults are avoided.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
In this embodiment, a server system is further provided, and the server system is used to implement the foregoing embodiments and preferred embodiments, and will not be described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 8 is a block diagram of still another alternative server system according to an embodiment of the present application, as shown in fig. 8, including: the CXL memory device group 802, the CXL switch group 804, the CXL host group 806 and the control device 808, wherein CXL memory device group 802 and CXL host group 806 link to each other with CXL switch group 804 through the CXL bus, and CXL switch group 804 is used for connecting CXL memory device group 802 and CXL host group 806.
A control device 808, configured to obtain parameter values of a set of operation parameters of CXL memory devices in the CXL memory device group 802, where all CXL memory devices in the CXL memory device group 802 are in an operation state, and the set of operation parameters is used to represent an operation state of the corresponding CXL memory device; predicting the operation state of the CXL memory devices in the CXL memory device group 802 according to the acquired parameter value of the set of operation parameters; in the case that an abnormal memory device running abnormally is predicted to exist in the CXL memory device group 802, a migration operation is controlled to be performed on the memory data in the abnormal memory device, so as to migrate the memory data in the abnormal memory device to a target memory device running normally in the CXL memory device group 802, where the abnormal memory device after the data migration is removed from the CXL memory device group 802.
Alternatively, steps S302 to S306 in the foregoing embodiment may be performed by the control device 808 in the present embodiment.
Acquiring parameter values of a group of operation parameters of CXL memory devices in the CXL memory device group through the server system, wherein the CXL memory devices in the CXL memory device group are in an operation state, and the group of operation parameters are used for representing the operation state of the corresponding CXL memory devices; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of a group of operation parameters; under the condition that abnormal memory devices running abnormally in the CXL memory device group are predicted, controlling to execute migration operation on memory data in the abnormal memory devices so as to migrate the memory data in the abnormal memory devices to target memory devices with normal running states in the CXL memory device group, wherein the abnormal memory devices after data migration are removed from the CXL memory device group, the problem that the CXL memory fault tolerance method in the related art has low memory utilization rate of a server due to the fact that a hot standby memory occupies a server slot is solved, and the memory utilization rate of the server is improved.
In some exemplary embodiments, control device 808 is coupled to CXL memory device group 802, CXL switch group 804, and CXL host group 806, respectively, via a designated bus that is a different bus than the CXL bus. The control device 808 is further configured to periodically perform a parameter value collection operation of a set of operating parameters on the CXL memory devices in the CXL memory device group 802 via the specified bus, to obtain a collected parameter value of the set of operating parameters.
In some exemplary embodiments, control device 808 is further configured to predict an operating state of a CXL memory device within CXL memory device group 802 based on parameter values for a set of operating parameters collected over a continuous plurality of cycles.
In some exemplary embodiments, control device 808 is further configured to predict an operating state of a CXL memory device in CXL memory device group 802 based on the parameter values of the set of operating parameters collected during the t-th period, the parameter values of the set of operating parameters collected during the t-1 th period, and the parameter values of the set of operating parameters collected during the t-2 th period.
In some exemplary embodiments, the set of operating parameters includes an erase speed v, a run-time frequency f, a run-time average temperature te, and a run-time average voltage vo, and the parameter values of the set of operating parameters collected during the nth period are v n、fn、ten and vo n, respectively, and n is a positive integer greater than or equal to 1.
Control device 808 further configured to construct a target data matrix D t, where D t=(St-2, St-1, St), an nth period of feature vectors corresponding to a set of operating parametersV is the configuration erasing speed, F is the configuration operation maximum frequency, TE is the configuration operation average temperature, and VO is the configuration operation average voltage; performing convolution processing on the target data matrix D t by using a preset convolution kernel I to obtain a target convolution value G t, wherein the preset convolution kernel I is a convolution kernel configured based on an influence factor corresponding to an operation parameter in a group of operation parameters, and the influence factor corresponding to the operation parameter in the group of operation parameters is used for representing the association degree between the operation parameter in the group of operation parameters and the memory fault of the corresponding memory device; the state reference value H t for the CXL memory devices in the CXL memory device group 802 is calculated by the following equation:
The operating states of the CXL memory devices in the CXL memory device group 802 are predicted based on the state reference H t for the CXL memory devices in the CXL memory device group 802 and the maximum row error unit number E t for the t-th cycle for the CXL memory devices in the CXL memory device group 802. Where μ, δ, ε, and ρ are the set weighting coefficients, E t is the maximum number of row error units for the corresponding CXL memory device in the t-th cycle, And S is the configuration standard deviation of the row error unit number of the corresponding CXL memory device, and the maximum row error unit number is the error unit number of the row with the largest error of the corresponding CXL memory device.
In some exemplary embodiments, the server system further includes a control device 808, the control device 808 being coupled to the CXL memory device group 802, the CXL switch group 804, and the CXL host group 806, respectively, via a designated bus that is a different bus than the CXL bus.
The control device 808 is further configured to obtain, from the CXL memory devices in the CXL memory device group 802 via the specified bus, configuration information for the CXL memory devices in the CXL memory device group 802 prior to predicting an operating state of the CXL memory devices in the CXL memory device group 802 based on parameter values of a set of operating parameters collected over a plurality of consecutive periods, wherein the configuration information for the CXL memory devices in the CXL memory device group 802 is included in the SPD data for the CXL memory devices in the CXL memory device group 802, the configuration information for the CXL memory devices in the CXL memory device group 802 being used to indicate the following configuration information for the CXL memory devices in the CXL memory device group 802: configuring the erasing speed V, configuring the maximum frequency F, configuring the average temperature TE, configuring the average voltage VO and the average value of the number of error unitsAnd a configuration standard deviation S of the number of row error cells.
In some exemplary embodiments, the control device 808 is further configured to obtain device information of the CXL devices in the server system via the designated bus before obtaining the configuration information of the CXL memory devices in the CXL memory device group 802 from the CXL memory devices in the CXL memory device group 802 via the designated bus, wherein the device types of the CXL devices in the server system include the CXL memory device, the CXL switch, and the CXL host; a CXL device information list is constructed based on the acquired device information for the CXL device in the server system, wherein the CXL device information list includes the device information for the CXL memory device in the CXL memory device group 802, the device information for the CXL switch in the CXL switch group 804, and the device information for the CXL host in the CXL host group 806, and the acquiring the configuration information for the abnormal memory device is performed by traversing the CXL device information list.
In some exemplary embodiments, S t、St-1 and S t-2 are each 4 row 1 column vectors, and the target data matrix D t is a 4 row 3 column data matrix.
The control device 808 is further configured to construct a preset convolution kernel I, wherein,
,/>,/>The factors of the erasing speed v, the running time frequency f, the running time average temperature te and the running time average voltage vo on the memory faults are respectively.
In some exemplary embodiments, the control device 808 is further configured to determine that the corresponding CXL memory device is operating abnormally if the state reference value H t of the CXL memory devices in the CXL memory device group 802 is greater than or equal to the first specified threshold; in the case where the maximum number of row error units E t of the t-th cycle is greater than or equal to the second specified threshold for a CXL memory device in the CXL memory device group 802, a corresponding CXL memory device is determined to be operating abnormally.
In some exemplary embodiments, the server system further includes a control device 808, the CXL memory devices in the CXL memory device group 802 allowing a virtual machine system running on the CXL host assigned to the CXL host group 806, the CXL host in the CXL host group 806 also running a virtual machine manager.
The control device 808 is further configured to determine, in the CXL host group 806, a target host connected to the abnormal memory device if it is predicted that the abnormal memory device exists in the CXL memory device group 802; and sending a first notification message to the target virtual machine manager to notify the target virtual machine manager to execute migration operation on the memory data in the abnormal memory device, wherein the target virtual machine manager is a virtual machine manager running on the target host, and the first notification message carries the device information of the abnormal memory device.
In some exemplary embodiments, the control device 808 is further configured to, prior to determining, in the CXL host group 806, a target host connected to the abnormal memory device, obtain, from the CXL switches of the CXL switch group 804, a set of designation information for the CXL switches of the CXL switch group 804, wherein the set of designation information includes port information and device information for the port connection; based on the acquired set of specified information of the CXL switches of the CXL switch group 804, a device node adjacency table is established, wherein the device node adjacency table is configured to indicate a connection relationship between the CXL memory devices in the CXL memory device group 802 and the CXL hosts in the CXL host group 806; if it is predicted that an abnormal memory device having an abnormal operation exists in the CXL memory device group 802, the device node adjacency list is queried, and it is determined that the CXL host connected to the abnormal memory device in the CXL host group 806 is the target host.
In some exemplary embodiments, the target virtual machine manager stores a target information table, where the target information table is used to record the CXL memory device associated with the target host and the virtual machine system to which the CXL memory device associated with the target host is assigned. The target virtual machine manager is further used for inquiring the target information table in response to the received first notification message so as to determine a target virtual machine system allocated by the abnormal memory device; and executing memory migration operation for the target virtual machine system to migrate the memory data in the abnormal memory device to the target memory device, and distributing the target memory device to the target virtual machine system.
In some exemplary embodiments, the target virtual machine manager is further configured to, before performing the memory migration operation for the target virtual machine system, query the target information table to find whether there is a CXL memory device satisfying the memory data migration condition among CXL memory devices connected to the target host; under the condition that CXL memory devices meeting the memory data migration conditions are found, the found CXL memory devices are determined to be target memory devices; if the CXL memory device satisfying the memory data migration condition is not found, sending a device application request message to the control device 808, where the device application request message is used to apply for accessing the CXL memory device satisfying the memory data migration condition to the target host; in the case of receiving a device application response message returned by the control device 808 in response to the device application request, determining the CXL memory device indicated by the device application response message as the target memory device; the memory data migration condition is that the idle memory in the idle state allows memory data in the abnormal memory device to be written in.
In some exemplary embodiments, the target virtual machine manager is further configured to migrate data in a first memory page of the abnormal memory device to a second memory page of the target memory device, and update a first translation record corresponding to the abnormal memory device in a target extended page table of the target virtual machine system to a second translation record to complete memory migration of the target virtual machine system, where the first translation record is used to indicate a correspondence between a memory page address of the first memory page and a physical memory address of the first memory page, and the second translation record is used to indicate a correspondence between a memory page address of the second memory page and a physical memory address of the second memory page.
In some exemplary embodiments, the target virtual machine manager is further configured to obtain a memory page update record after migrating the data in the first memory page of the abnormal memory device to the second memory page of the target memory device, where the memory page update record is used to record an update of the data in the memory page of the abnormal memory device during migrating the data in the first memory page to the second memory page; and updating the data in the memory page of the target memory device according to the memory page update record.
In some exemplary embodiments, the target virtual machine manager is further configured to, before updating a first translation record corresponding to the abnormal memory device in a target extended page table of the target virtual machine system to a second translation record, add a third translation record in a memory translation relationship table, and delete the first translation record from the target extended page table, where the memory translation relationship table is used to record a correspondence between a memory page address of a memory page to which data of the CXL memory device having an abnormal state has been migrated and a memory page address of the memory page to which the data has been migrated; and responding to the obtained target page fault abnormality, traversing the memory conversion relation table based on the memory page address of the first memory page to obtain a third conversion record, wherein the target expansion page table is updated based on the memory page address of the second memory page in the third conversion record.
In some exemplary embodiments, the control device 808 is further configured to display, on the designated display interface, abnormality alert information after controlling the migration operation on the memory data in the abnormal memory device, where the abnormality alert information is used to indicate a device location of the abnormal memory device and an abnormality cause of the abnormal memory device; and sending a second notification message to the abnormal memory device, wherein the second notification message is used for notifying the abnormal memory device to alarm.
In some exemplary embodiments, the control device 808 is further configured to, after predicting an operation state of the CXL memory devices in the CXL memory device group 802 according to the obtained parameter values of the set of operation parameters, save the parameter values of the set of operation parameters if it is predicted that the CXL memory devices in the CXL memory device group 802 are operating normally.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; or the above modules may be located in different processors in any combination.
According to one aspect of the present application, there is provided a computer program product comprising a computer program/instruction containing program code for executing the method shown in the flow chart. Fig. 9 schematically shows a block diagram of a computer system for implementing an electronic device of an embodiment of the application. As shown in fig. 9, the computer system 900 includes a CPU 901 which can execute various appropriate actions and processes according to a program stored in a ROM 902 (Read-Only Memory) or a program loaded from a storage portion 908 into a RAM 903 (Random Access Memory ). In the RAM 903, various programs and data required for system operation are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other through a bus 904. I/O interface 905 (Input/Output interface) is also connected to bus 904.
The following components are connected to the input/output interface 905: an input section 906 including a keyboard, a mouse, and the like; an output section 907 including a CRT (Cathode Ray Tube), an LCD (Liquid CRYSTAL DISPLAY), and the like, and a speaker, and the like; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a local area network card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 910 so that a computer program read out therefrom is installed into the storage section 908 as needed.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. When executed by the central processor 901, performs various functions defined in the system of the present application.
It should be noted that, the computer system 900 of the electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a U disk, ROM, RAM, a removable hard disk, a magnetic disk, or an optical disk.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (22)

1. A CXL memory fault tolerance method is characterized in that,
The system comprises a server system, wherein the server system comprises a CXL memory device group, a CXL exchange unit and a CXL host unit, the CXL memory device group and the CXL host unit are connected with the CXL exchange unit through a CXL bus, and the CXL exchange unit is used for connecting the CXL memory device group and the CXL host unit;
The method comprises the following steps:
Acquiring parameter values of a set of operation parameters of CXL memory devices in the CXL memory device group, wherein the CXL memory devices in the CXL memory device group are all in an operation state, and the set of operation parameters are used for representing the operation state of the corresponding CXL memory devices;
Predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of the group of operation parameters;
and under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, controlling to execute migration operation on the memory data in the abnormal memory device so as to migrate the memory data in the abnormal memory device to a target memory device with normal operation state in the CXL memory device group, wherein the abnormal memory device after data migration is removed from the CXL memory device group.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The server system further comprises a control device which is respectively connected with the CXL memory device group, the CXL exchange group and the CXL host group through a designated bus, wherein the designated bus and the CXL bus are different buses;
The obtaining parameter values of a set of operating parameters of CXL memory devices in the CXL memory device group includes:
And periodically executing the parameter value acquisition operation of the set of operation parameters on CXL memory devices in the CXL memory device group through the appointed bus by the control device to obtain the acquired parameter values of the set of operation parameters.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The predicting the operation state of the CXL memory device in the group of CXL memory devices according to the acquired parameter value for the set of operation parameters includes:
And predicting the operation state of the CXL memory devices in the CXL memory device group according to the parameter values of the group of operation parameters acquired in a plurality of continuous periods.
4. The method of claim 3, wherein the step of,
The predicting the operation state of the CXL memory device in the group of CXL memory devices based on the parameter values for the set of operation parameters collected over a plurality of consecutive periods includes:
And predicting the operation state of CXL memory devices in the CXL memory device group according to the parameter values of the group of operation parameters acquired in the t-th period, the parameter values of the group of operation parameters acquired in the t-1 th period and the parameter values of the group of operation parameters acquired in the t-2 th period.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,
The set of operation parameters comprise erasing speed v, operation time frequency f, operation time average temperature te and operation time average voltage vo, the parameter values of the set of operation parameters acquired in the nth period are v n、fn、ten and vo n respectively, and n is a positive integer greater than or equal to 1;
The predicting the operation state of the CXL memory device group based on the parameter value for the set of operation parameters collected during the t-1 st period, the parameter value for the set of operation parameters collected during the t-2 nd period, and the parameter value for the set of operation parameters collected during the t-1 st period, includes:
constructing a target data matrix D t, wherein D t=(St-2, St-1, St), a feature vector of an nth period corresponding to the set of operating parameters V is the configuration erasing speed, F is the configuration operation maximum frequency, TE is the configuration operation average temperature, and VO is the configuration operation average voltage;
Performing convolution processing on the target data matrix D t by using a preset convolution kernel I to obtain a target convolution value G t, wherein the preset convolution kernel I is a convolution kernel configured based on an influence factor corresponding to an operation parameter in the group of operation parameters, and the influence factor corresponding to the operation parameter in the group of operation parameters is used for representing the association degree between the operation parameter in the group of operation parameters and a memory fault of a corresponding memory device;
the state reference value H t of the CXL memory devices in the group of CXL memory devices is calculated by the following formula:
Where μ, δ, ε, and ρ are the set weighting coefficients, E t is the maximum number of row error units for the corresponding CXL memory device in the t-th cycle, The method comprises the steps that a configuration average value of the row error unit number of corresponding CXL memory equipment is obtained, S is a configuration standard deviation of the row error unit number of the corresponding CXL memory equipment, and the maximum row error unit number is the error unit number of the row with the largest error of the corresponding CXL memory equipment;
And predicting the running state of the CXL memory devices in the CXL memory device group based on the state reference value H t of the CXL memory devices in the CXL memory device group and the maximum row error unit number E t of the CXL memory devices in the t-th period.
6. The method of claim 5, wherein the step of determining the position of the probe is performed,
The server system further comprises a control device which is respectively connected with the CXL memory device group, the CXL exchange group and the CXL host group through a designated bus, wherein the designated bus and the CXL bus are different buses;
Before predicting the operating state of the CXL memory device within the group of CXL memory devices based upon the parameter values for the set of operating parameters collected over the consecutive plurality of cycles, the method further comprises:
Obtaining, by the control device, configuration information of CXL memory devices in the group of CXL memory devices from the CXL memory devices via the specified bus, wherein the configuration information of the CXL memory devices in the group of CXL memory devices is included in the SPD data of the CXL memory devices in the group of CXL memory devices, the configuration information of the CXL memory devices in the group of CXL memory devices including configuration information for indicating at least one of the following CXL memory devices in the group of CXL memory devices: configuring the erasing speed V, configuring the maximum frequency F, configuring the average temperature TE, configuring the average voltage VO and the average value of the number of error units And a configuration standard deviation S of the number of row error cells.
7. The method of claim 6, wherein the step of providing the first layer comprises,
Before the obtaining, by the control device, configuration information of CXL memory devices in the group of CXL memory devices from the CXL memory devices via the designated bus, the method further comprises:
Obtaining device information from CXL devices in the server system through the appointed bus by the control device to obtain the device information of the CXL devices in the server system, wherein the device types of the CXL devices in the server system comprise CXL memory devices, CXL switches and CXL hosts;
And constructing a CXL device information list based on the acquired device information of the CXL devices in the server system, wherein the CXL device information list comprises the device information of the CXL memory devices in the CXL memory device group, the device information of the CXL switches in the CXL switch group and the device information of the CXL host in the CXL host group, and acquiring the configuration information of the abnormal memory devices is performed by traversing the CXL device information list.
8. The method of claim 5, wherein the step of determining the position of the probe is performed,
S t、St-1 and S t-2 are column vectors of 4 rows and 1 column, and the target data matrix D t is a data matrix of 4 rows and 3 columns;
Before the target data matrix D t is subjected to convolution processing by using the preset convolution kernel I to obtain a target convolution value G t, the method further includes:
the preset convolution kernel I is constructed, wherein, ,/>
,/>The factors of the erasing speed v, the running time frequency f, the running time average temperature te and the running time average voltage vo on the memory faults are respectively.
9. The method of claim 5, wherein the step of determining the position of the probe is performed,
The predicting the current running state of the CXL memory devices in the CXL memory device group based on the state reference value H t of the CXL memory devices in the CXL memory device group and the maximum row error unit number E t of the CXL memory device in the CXL memory device group at the t-th cycle includes:
Determining that the corresponding CXL memory device runs abnormally under the condition that the state reference value H t of the CXL memory devices in the CXL memory device group is larger than or equal to a first specified threshold value;
And determining that the corresponding CXL memory device runs abnormally under the condition that the maximum row error unit number E t of the t-th period of the CXL memory devices in the CXL memory device group is larger than or equal to a second specified threshold.
10. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The server system further comprises control equipment, CXL memory equipment in the CXL memory equipment group allows a virtual machine system running on a CXL host machine allocated to the CXL host machine group, and a virtual machine manager also runs on the CXL host machine in the CXL host machine group;
The controlling, in the case where it is predicted that an abnormal memory device having an abnormal operation exists in the CXL memory device group, performing a migration operation on memory data in the abnormal memory device includes:
Under the condition that abnormal memory devices with abnormal operation in the CXL memory device group are predicted, determining a target host connected with the abnormal memory devices in the CXL host group through the control device;
And sending a first notification message to a target virtual machine manager through the control device so as to notify the target virtual machine manager to execute migration operation on the memory data in the abnormal memory device, wherein the target virtual machine manager is a virtual machine manager running on the target host, and the first notification message carries the device information of the abnormal memory device.
11. The method of claim 10, wherein the step of determining the position of the first electrode is performed,
Before the determining, by the control device, the target host connected to the abnormal memory device in the CXL host group, the method further includes: acquiring a set of specified information of the CXL switch unit from the CXL switch of the CXL switch unit through the control equipment, wherein the set of specified information comprises port information and equipment information of port connection; establishing a device node adjacency table based on the acquired set of specified information of the CXL switches of the CXL switch group, wherein the device node adjacency table is used for indicating the connection relationship between CXL memory devices in the CXL memory device group and CXL hosts in the CXL host group;
And under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, determining, by the control device, a target host connected with the abnormal memory device in the CXL host group, wherein the target host comprises: and under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, inquiring the device node adjacent table through the control device, and determining that the CXL host connected with the abnormal memory device in the CXL host group is the target host.
12. The method of claim 10, wherein the step of determining the position of the first electrode is performed,
The target virtual machine manager stores a target information table, wherein the target information table is used for recording CXL memory equipment connected with the target host and a virtual machine system allocated to the CXL memory equipment connected with the target host;
After the sending, by the control device, a first notification message to the target virtual machine manager, the method further comprises:
Responding to the received first notification message, and inquiring the target information table through the target virtual machine manager to determine a target virtual machine system allocated by the abnormal memory device;
and executing memory migration operation for the target virtual machine system through the target virtual machine manager so as to migrate the memory data in the abnormal memory device into the target memory device, and distributing the target memory device to the target virtual machine system.
13. The method of claim 12, wherein the step of determining the position of the probe is performed,
Before the memory migration operation is performed for the target virtual machine system by the target virtual machine manager, the method further includes:
Querying the target information table through the target virtual machine manager to find whether CXL memory equipment meeting memory data migration conditions exists in CXL memory equipment connected with the target host;
under the condition that CXL memory equipment meeting the memory data migration condition is found, determining the found CXL memory equipment as the target memory equipment;
if CXL memory devices meeting the memory data migration conditions are not found, sending a device application request message to the control device through the target virtual machine manager, wherein the device application request message is used for applying for accessing the CXL memory devices meeting the memory data migration conditions to the target host;
Under the condition that a device application response message returned by the control device in response to the device application request is received, determining CXL memory devices indicated by the device application response message as the target memory devices;
the memory data migration condition is that the idle memory in the idle state allows memory data in the abnormal memory device to be written in.
14. The method of claim 12, wherein the step of determining the position of the probe is performed,
The executing, by the target virtual machine manager, a memory migration operation for the target virtual machine system, including:
Migrating data in a first memory page of the abnormal memory device to a second memory page of the target memory device through the target virtual machine manager, and updating a first conversion record corresponding to the abnormal memory device in a target expansion page table of the target virtual machine system to a second conversion record so as to complete memory migration of the target virtual machine system, wherein the first conversion record is used for indicating a corresponding relation between a memory page address of the first memory page and a physical memory address of the first memory page, and the second conversion record is used for indicating a corresponding relation between a memory page address of the second memory page and a physical memory address of the second memory page.
15. The method of claim 14, wherein the step of providing the first information comprises,
After the migrating, by the target virtual machine manager, the data in the first memory page of the abnormal memory device to the second memory page of the target memory device, the method further comprises:
Acquiring a memory page update record through the target virtual machine manager, wherein the memory page update record is used for recording the update of the data in the memory page of the abnormal memory device in the process of migrating the data in the first memory page to the second memory page;
and updating the data in the memory page of the target memory device according to the memory page update record.
16. The method of claim 14, wherein the step of providing the first information comprises,
Before updating the first translation record corresponding to the abnormal memory device in the target extended page table of the target virtual machine system to the second translation record, the method further includes:
Adding a third conversion record in a memory conversion relation table through the target virtual machine manager, and deleting the first conversion record from the target expansion page table, wherein the memory conversion relation table is used for recording a corresponding relation between a memory page address of a memory page to which data of CXL memory equipment with abnormal states are migrated and a memory page address of the memory page to which the data are migrated;
And responding to the obtained target page fault abnormality, traversing the memory conversion relation table by the target virtual machine manager based on the memory page address of the first memory page to obtain the third conversion record, wherein the target page fault abnormality is triggered based on the access operation to the memory page address of the first memory page, and the target expansion page table is updated based on the memory page address of the second memory page in the third conversion record.
17. The method of claim 1, wherein the step of determining the position of the substrate comprises,
After the control performs a migration operation on the memory data in the abnormal memory device, the method further includes:
Displaying abnormal alarm information on a designated display interface, wherein the abnormal alarm information is used for indicating the equipment position of the abnormal memory equipment and the abnormal reason of the abnormal memory equipment;
and sending a second notification message to the abnormal memory device, wherein the second notification message is used for notifying the abnormal memory device to alarm.
18. The method according to any one of claims 1 to 17, wherein,
After predicting the operation states of the CXL memory devices in the group of CXL memory devices based on the obtained parameter values for the set of operation parameters, the method further includes:
and under the condition that the CXL memory devices in the CXL memory device group are predicted to normally operate, saving the parameter values of the group of operation parameters.
19. A server system, characterized in that,
Comprising the following steps: the CXL memory device group is connected with the CXL exchange unit through a CXL bus, the CXL exchange unit is used for connecting the CXL memory device group with the CXL host unit, wherein,
The control device is configured to obtain parameter values of a set of operation parameters of CXL memory devices in the CXL memory device group, where all CXL memory devices in the CXL memory device group are in an operation state, and the set of operation parameters is used to represent an operation state of the corresponding CXL memory device; predicting the operation state of CXL memory devices in the CXL memory device group according to the acquired parameter values of the group of operation parameters; and under the condition that the abnormal memory device with abnormal operation in the CXL memory device group is predicted, controlling to execute migration operation on the memory data in the abnormal memory device so as to migrate the memory data in the abnormal memory device to a target memory device with normal operation state in the CXL memory device group, wherein the abnormal memory device after data migration is removed from the CXL memory device group.
20. A computer program product, characterized in that,
The computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 18.
21. A computer-readable storage medium comprising,
The computer readable storage medium having stored therein a computer program, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1 to 18.
22. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,
The processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 18.
CN202410532219.XA 2024-04-29 2024-04-29 CXL memory fault tolerance method, server system, storage medium and electronic equipment Pending CN118132350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410532219.XA CN118132350A (en) 2024-04-29 2024-04-29 CXL memory fault tolerance method, server system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410532219.XA CN118132350A (en) 2024-04-29 2024-04-29 CXL memory fault tolerance method, server system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN118132350A true CN118132350A (en) 2024-06-04

Family

ID=91242875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410532219.XA Pending CN118132350A (en) 2024-04-29 2024-04-29 CXL memory fault tolerance method, server system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN118132350A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104349A1 (en) * 2006-10-25 2008-05-01 Tetsuya Maruyama Computer system, data migration method and storage management server
CN103324582A (en) * 2013-06-17 2013-09-25 华为技术有限公司 Memory migration method, memory migration device and equipment
CN108205424A (en) * 2017-12-29 2018-06-26 北京奇虎科技有限公司 Data migration method, device and electronic equipment based on disk
CN115686909A (en) * 2022-10-31 2023-02-03 苏州浪潮智能科技有限公司 Memory fault prediction method and device, storage medium and electronic device
CN117668706A (en) * 2023-12-13 2024-03-08 中国建设银行股份有限公司 Method and device for isolating memory faults of server, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080104349A1 (en) * 2006-10-25 2008-05-01 Tetsuya Maruyama Computer system, data migration method and storage management server
CN103324582A (en) * 2013-06-17 2013-09-25 华为技术有限公司 Memory migration method, memory migration device and equipment
CN108205424A (en) * 2017-12-29 2018-06-26 北京奇虎科技有限公司 Data migration method, device and electronic equipment based on disk
CN115686909A (en) * 2022-10-31 2023-02-03 苏州浪潮智能科技有限公司 Memory fault prediction method and device, storage medium and electronic device
CN117668706A (en) * 2023-12-13 2024-03-08 中国建设银行股份有限公司 Method and device for isolating memory faults of server, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US8191069B2 (en) Method of monitoring performance of virtual computer and apparatus using the method
US8396917B2 (en) Storage management system, storage hierarchy management method, and management server capable of rearranging storage units at appropriate time
US9348724B2 (en) Method and apparatus for maintaining a workload service level on a converged platform
WO2017140131A1 (en) Data writing and reading method and apparatus, and cloud storage system
US7392433B2 (en) Method and system for deciding when to checkpoint an application based on risk analysis
CN108923992B (en) High-availability method and system for NAS cluster, electronic equipment and storage medium
CN108628717A (en) A kind of Database Systems and monitoring method
CN112199240B (en) Method for switching nodes during node failure and related equipment
CN105786405A (en) Online upgrading method, device and system
US20140201356A1 (en) Monitoring system of managing cloud-based hosts and monitoring method using for the same
CN110780811B (en) Data protection method, device and storage medium
CN107864055A (en) The management method and platform of virtualization system
WO2018171728A1 (en) Server, storage system and related method
CN114979158A (en) Resource monitoring method, system, equipment and computer readable storage medium
CN109302445A (en) Host node state determines method, apparatus, host node and storage medium
CN112256433A (en) Partition migration method and device based on Kafka cluster
CN110888769B (en) Data processing method and computer equipment
CN118132350A (en) CXL memory fault tolerance method, server system, storage medium and electronic equipment
CN109840051B (en) Data storage method and device of storage system
CN115878052A (en) RAID array inspection method, inspection device and electronic equipment
CN107147516B (en) Server, storage system and related method
CN115150253B (en) Fault root cause determining method and device and electronic equipment
US8074109B1 (en) Third-party voting to select a master processor within a multi-processor computer
CN115904773A (en) Memory fault information collection method and device and storage medium
CN112187919B (en) Storage node management method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination