CN115617581A

CN115617581A - Memory fault processing method and device

Info

Publication number: CN115617581A
Application number: CN202211166819.6A
Authority: CN
Inventors: 李钟�; 周栋树; 楼佳
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-01-17
Also published as: CN110837444A; CN110837444B

Abstract

A memory fault processing method and device are applied to equipment comprising a first memory controller and a second memory controller, wherein the first memory controller manages a first memory, and the second memory controller manages a second memory. The method comprises the following steps that a storage space formed by a second storage managed by a second storage controller is a backup space of a storage space of a first storage managed by a first storage controller, and the method comprises the following steps: when the first storage managed by the first storage controller is found to be in failure, the data read-write operation of the device is transferred from the storage space formed by the first storage managed by the first storage controller to the storage space formed by the second storage managed by the second storage controller. The method is suitable for a scene that the DCPMM serves as the first memory, and the reliability of data in the DCPMM can be ensured.

Description

Memory fault processing method and device

The present application is a divisional application, the original application having an application number of 201910945454.9 and a date of 2019, 09 and 30, the entire contents of which are incorporated herein by reference.

Technical Field

The present application relates to the field of storage technologies, and in particular, to a method and an apparatus for processing a memory fault.

Background

The memory is an important component in the device, and can store data required for operation of a processor in the device and exchange data with an external memory such as a hard disk in the device.

When the device is in operation, the processor may first call data required for operation into the memory, and the processor obtains the data required for operation from the memory to perform the operation, and if the data in the memory is wrong, the overall operation of the device may be affected.

The conventional memory is a volatile memory, that is, when the device is powered off, data in the memory is lost, so that the device cannot normally operate. In order to ensure that data in the memory can still exist after the device is powered down, a persistent memory is derived, such as a data center persistent memory (DCPMM), which is also called a dc persistent memory in the industry, and has a larger capacity and a read-write speed close to a volatile memory.

Although the persistent memory can store data for a long time, the persistent memory also faces the problem of data reliability, for example, when the persistent memory fails in performance, the processor cannot acquire data from the persistent memory for operation, and thus the device cannot operate normally.

In summary, a method for effectively ensuring data reliability suitable for persistent memory is needed.

Disclosure of Invention

The application provides a memory fault processing method and device, which are used for providing a method capable of ensuring the reliability of data in a DCPMM.

In a first aspect, the present application provides a memory failure handling method, which is applied to an apparatus including a first memory controller and a second memory controller, wherein the first memory controller manages a first memory, and the second memory controller manages a second memory; here, the number of the first memory and the second memory is not limited, and may be one or a plurality of memories. The method comprises the following steps that a storage space formed by a second storage managed by a second storage controller is a backup space of a storage space of a first storage managed by a first storage controller, and the method comprises the following steps: when the first storage managed by the first storage controller is found to be in failure, the data read-write operation of the device is transferred from the storage space formed by the first storage managed by the first storage controller to the storage space formed by the second storage managed by the second storage controller.

By the method, the storage space formed by the memories managed by the storage controllers (the first storage controller and the second storage controller) is taken as the main storage space and the standby storage space, and the method is suitable for a scene that the DCPMM is taken as the memory (such as the first memory or the second memory); in addition, when the memory managed by one of the memory controllers fails, the data read-write operation of the device is switched to the memory space managed by the other manager, so that the reliability of the data can be ensured.

In a possible implementation manner, in the device, data write-back may be further performed on the failed data in the first storage managed by the first storage controller by using data in the second storage managed by the second storage controller, and the data of the failed first storage is recovered.

If the write-back is successful, after the device is powered on next time, the data read-write operation of the device can be switched to a storage space formed by the first storage managed by the first storage controller.

By the method, the data of the first memory with the fault can be effectively recovered, so that the storage space formed by the first memory managed by the first memory controller can be recovered to be normal, and the device can continue to read and write the data in the storage space formed by the first memory managed by the first memory controller.

In a possible implementation manner, if the write back fails, the device may issue an alarm message, where the alarm message is used to indicate that the first memory fails.

By the method, if the data of the first memory with the fault cannot be recovered, the user can be reminded in time, so that the user can find the fault of the first memory in the equipment in time and replace the first memory with the fault.

In one possible implementation, a first storage controller manages the DCPMM and a second storage controller manages the DCPMM. That is, the first memory and the second memory are both DCPMMs, and the first memory controller and the second memory controller are memory controllers, for example, the memory controllers may be IMCs.

By the method, the storage space formed by the DCPMM managed by the memory controller is taken as the granularity as the main storage space and the spare storage space, the reliability of data in the DCPMM can be better ensured, the method is suitable for different insertion methods of the DCPMM, and the DCPMM has better read-write speed, so that the overall data reading efficiency of the equipment is higher under the condition, and the overall performance of the equipment can be ensured not to be influenced.

In one possible implementation, the first storage controller manages DCPMM, the second storage controller manages a plurality of disks, the first storage controller is a memory controller, and the second storage controller is a RAID card.

By the method, the reliability of the data in the DCPMM managed by the memory controller can be ensured, and under the condition, the storage space formed by a plurality of disks is used as a backup space instead of the storage space formed by part of the DCPMM managed by the memory controller, so that the cost can be effectively saved.

In a possible implementation manner, in the apparatus, the storage spaces of the plurality of first memories managed by the first storage controller may be mapped to the logical space of the first storage controller, and a mapping relationship that addresses of the storage spaces of the first memories are mapped to addresses of the logical space of the first storage controller is established, so that the first storage controller may allocate addresses of the logical space to data when the data is stored in the first memory, which may distribute the data more uniformly in the respective first memories, thereby implementing data equalization of the plurality of first memories managed by the first memories.

In one possible implementation, before the first memory managed by the first memory controller fails, failure detection may be performed on the first memory managed by the first memory controller. A default function can be set in the device, and when the device is initially started, the function can be started, and the function comprises a fault detection function and further indicates that when the first storage is detected to be in fault, data write-back is carried out on the first storage by using the data of the backup space.

By the method, the fault detection function is started in advance, the first storage with the fault can be found in time, and when the first storage has the fault, data can be written back in time, so that the reliability of the data is ensured, and the normal operation of equipment is ensured.

In a possible implementation manner, before a failure occurs in a first storage managed by a first storage controller, when starting up a device, a user may be prompted to select a mode of a main/standby storage space, for example, a first mode or a second mode, where the first mode may be a performance mode capable of ensuring read/write efficiency (corresponding to a case where the first storage and the second storage are both DCPMM, and the first storage controller and the second storage controller are both memory controllers), and the second mode may be a cost mode capable of saving cost (corresponding to a case where the first storage is DCPMM, the first storage controller is a memory controller, the second storage is a disk, and the second storage controller is a RAID card). The user may select the desired mode and trigger an instruction to indicate that the first mode or the second mode is configured. After receiving an instruction triggered by a user, the equipment responds to the instruction triggered by the user and configures a first mode or a second mode; when configuring the first mode or the second mode, address mapping may be performed on a storage space formed by the first storage managed by the first storage controller and a storage space formed by the second storage managed by the second storage controller, the storage space formed by the first storage managed by the first storage controller is configured as a main storage space, and the storage space formed by the second storage managed by the second storage controller is configured as a spare storage space.

By the method, the main storage space and the standby storage space in different modes can be flexibly configured according to the instruction triggered by the user, more choices are provided for the user, the user experience is improved, and the storage space formed by the first storage managed by the first storage controller and the storage space formed by the second storage managed by the second storage controller are associated in an address mapping mode.

In a possible implementation manner, when address mapping is performed on a storage space formed by a first storage managed by a first storage controller and a storage space formed by a second storage managed by a second storage controller, the storage space formed by the first storage managed by the first storage controller may be used as a block, and the block is divided into a plurality of sub-blocks; taking a storage space formed by a second storage managed by a second storage controller as a block, and dividing the block into a plurality of sub-blocks; and establishing a one-to-one mapping relation between the addresses of the subblocks of the storage space formed by the first storage managed by the first storage controller and the addresses of the subblocks of the storage space formed by the second storage managed by the second storage controller.

By the method, the address mapping is carried out on the storage space formed by the first storage managed by the first storage controller and the storage space formed by the second storage managed by the second storage controller in a sub-block dividing mode, and the mode is simpler and more efficient.

In a second aspect, an embodiment of the present application further provides a fault handling apparatus, and for beneficial effects, reference may be made to the description of the first aspect and details are not described here again. The apparatus has functionality to implement the actions in the method instance of the first aspect described above. The functions may be implemented by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions. In a possible design, the structure of the fault handling apparatus includes a detection unit and a switching unit, and may further include a write-back unit and a configuration unit, where these units may perform corresponding functions in the method example of the first aspect, for specific reference, detailed description in the method example is given, and details are not repeated here.

In a fourth aspect, an embodiment of the present application further provides an apparatus, and for beneficial effects, reference may be made to the description of the first aspect, which is not described herein again. The structure of the device comprises a processing unit, a first memory and a second memory, wherein the processing unit is configured to support the device to perform corresponding functions in the method of the first aspect. A first memory and a second memory are coupled to the processor, the first memory or the second memory holding the necessary program instructions and data for the device. The structure of the device also comprises a communication interface used for communicating with other devices.

In a fifth aspect, the present application also provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

In a sixth aspect, the present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

In a seventh aspect, the present application further provides a computer chip, where the chip is connected to a memory, and the chip is used to read and execute a software program stored in the memory, and execute the methods in the above aspects.

Drawings

FIG. 1 is a diagram illustrating an architecture of a memory management system according to the present application;

FIG. 2 is a schematic diagram of another memory management system architecture provided herein;

FIG. 3 is a schematic diagram of a configuration method of a memory management system according to the present application;

FIG. 4 is a schematic diagram illustrating a method for handling memory failure according to the present application;

FIG. 5 is a schematic diagram of another memory failure handling method provided in the present application;

fig. 6 is a schematic structural diagram of a fault handling apparatus provided in the present application;

fig. 7 is a schematic structural diagram of an apparatus provided in the present application.

Detailed Description

The embodiment of the application provides a memory failure processing method, wherein a memory space managed by a memory controller is used as granularity for mirror image configuration, a memory space formed by a first memory managed by a first memory controller is used as a main memory space, a memory space formed by a second memory managed by a second memory controller is used as a standby memory space, when a first memory managed by the first memory controller fails, a device switches the main memory space and the standby memory space, switches data read-write operation of the device to the standby memory space, and writes back data to the failed first memory by using data stored in the standby memory space. The embodiment of the application uses the storage space of the memories (the first memory and the second memory) managed by the memory controllers (the first memory controller and the second memory controller) as the granularity as the main storage space and the spare storage space, is suitable for the situation that the DCPMM is used as the memory (such as the first memory or the second memory), can ensure the data reliability of the DCPMM, is suitable for different insertion methods of a plurality of DCPMMs in equipment, and effectively expands the application range.

In this embodiment, the types of the memories (such as the first memory or the second memory) managed by the two memory controllers may be the same or different, for example, both of the two memory controllers may manage multiple DCPMMs, and for example, one memory controller manages multiple DCPMMs, and the other memory controller manages a Redundant Array of Independent Disks (RAID) composed of multiple disks. Two management systems for memory in devices to which the present application is applicable are described below. In the embodiment of the present application, the first memory or the second memory may also be another type of memory other than DCPMM or a disk, and the embodiment of the present application is only described with reference to DCPMM or a disk as an example.

First, two memory controllers may each manage multiple DCPMMs, and two memory controllers may be memory controllers (memory controllers) that manage the DCPMMs, for example, the memory controllers may be capable of reading data from the DCPMMs, writing data into the DCPMMs, buffering data in the DCPMMs, and so on. The memory controller may be a separate module in the device, or may be integrated in a processor in the device, such as a Central Processing Unit (CPU). Illustratively, the memory controller may be an Integrated Memory Controller (IMC). The IMC is integrated in the processor to realize the function of managing the DCOMM, wherein the function comprises the initialization of the DCPMM, the data reading and writing of the DCPMM and the caching operation.

The embodiment of the present application is only an example in which the memory controller is an IMC, and the embodiment of the present application does not limit the specific form of other types of memory controllers, nor does the memory controller need to be integrated in a processor, and any controller capable of managing DCPMM can be used as the memory controller in the present application.

As shown in fig. 1, a schematic diagram of an architecture of a memory management system provided in an embodiment of the present application, the memory management system being located in a device, the system including a CPU100, and various steps in the method embodiment provided in the embodiment of the present application may be, but are not limited to, executed by the CPU 100.

Two IMCs, IMC110 and IMC120, are included in CPU 100. Each IMC manages three channels (channels), as IMC110 manages channels 111, 112, and 113.IMC120 manages channel121, channel122, and channel123. At most two memory slots can be supported in each channel, and each memory slot can be inserted with a DCPMM or a Dynamic Random Access Memory (DRAM).

The DCPMM is a special memory, and the DCPMM adopts a Dual Inline Memory Module (DIMM) packaging interface, and can be used as a nonvolatile memory or a volatile memory under different modes. For example, there are three different modes of DCPMM, which are Memory Mode (MM), application mode (AD), and mixed Mode (MIX). The DCPMM in the memory mode can be used as a volatile memory, and the DCPMM in the application mode can be used as a non-volatile memory, so that power-down data can be prevented from being lost; the DCPMM in the mixed mode can be used as a nonvolatile memory in part of the storage space and can be used as a volatile memory in part of the storage space.

The DCPMM can be inserted into memory slot positions under each channel in the IMC or not, due to the use restriction of the DCPMM, DRAM is required to be matched with the DCPMM to apply an insertion method on the memory slot positions of the equipment, and the current DCPMM supports 15 specifications of insertion methods in total according to capacity usage. For example, a DCPMM is inserted into each DIMM in the IMC, which is a method of insertion; also for example, an insertion method is used in which only one DIMM is inserted into each channel in the IMC.

Under this memory management system architecture, one of the memory spaces formed by DCPMMs managed by IMC110 (referred to as the memory space managed by IMC 110) and managed by IMC120 may be configured as a primary memory space, and the memory space managed by the other IMC may be configured as a secondary memory space. That is, one IMC is configured as a primary IMC, the other IMC is configured as a secondary IMC, and data read-write operations of the device occur in the primary storage space while data backup in the primary storage space is saved in the secondary storage space.

As a possible implementation manner, in order to ensure that the data in the main storage space and the spare storage space can be distributed more uniformly in the DCPMMs under the channels, data equalization may be performed on the storage space managed by each IMC.

Illustratively, for the memory space managed by an IMC, a part or all of the memory space is taken out from the DCPMM under each channel in the IMC and mapped to the logical space managed by the IMC. When data is stored in a storage space managed by the IMC, the IMC does not store the data in one DCPMM managed by the IMC, but allocates a storage address to the data from a logical space, so that the data can be uniformly distributed in the DCPMM under each channel.

Second, a memory controller manages a plurality of DCPMMs, the memory controller being an IMC; another storage controller manages RAID, which is a RAID card.

As shown in fig. 2, a schematic diagram of an architecture of a memory management system according to an embodiment of the present application, the memory management system is located in a device, the system includes a CPU100, and the CPU100 may include two IMCs, i.e., an IMC110 and an IMC120. The structures of IMC110 and IMC120 are similar to the system architecture shown in fig. 1, and reference is made to the foregoing description for details, which are not repeated here.

Under the memory management system architecture, a RAID card 200 is also included, the RAID card 200 is used for realizing RAID functions, and a redundancy strategy is configured for a RAID210 formed by a plurality of disks 211. In this embodiment, RAID210 may be a mirror image unit of a storage space managed by IMC110 or/and IMC120, that is, RAID210 may be used to store data backup of the storage space managed by IMC110 or/and IMC120, and in addition, a redundancy policy configured by RAID card 200 for RAID210 may be RAID1, where RAID1 allows data to be read from and written to two disks 211 at the same time, and ensures that data in two disks 211 are consistent. The RAID card 200 may also configure other redundancy policies for the RAID210, and the embodiment of the present application does not limit the RAID card to the redundancy policies configured for the RAID, and RAID1 is merely an example.

Under this memory management system architecture, the storage space (referred to as storage space managed by IMC110 and/or IMC120 for short) composed of multiple DCPMMs managed by IMC110 and/or IMC120 is configured as main storage space, and RAID210 is spare storage space. The data read-write operation of the device occurs in the main storage space, while the backup of data in the main storage space is saved in the backup storage space.

Under the memory management system architecture, data balancing can also be performed on the memory space managed by each IMC. The way of implementing data equalization is the same as that of implementing data equalization by IMC in the first memory management system architecture, and reference may be made to the foregoing specifically, and details are not described here again.

The first memory management system takes a memory space formed by half of DCPMMs in the whole memory management system as a standby memory space, and takes two IMCs, one as a main memory and the other as a standby memory; because the device still reads or writes data from the DCPMM when reading or writing data, the data reading and writing efficiency of the whole memory management system is not affected, and the data reading and writing performance of the whole memory management system can be effectively ensured.

The second memory management system uses RAID as a mirroring unit, and does not need to use a storage space formed by a part of DCPMMs in the whole memory management system as a backup space, so that cost can be effectively saved.

The embodiment of the application provides two memory management systems, and when the device is initially started, the memory management system of the device can be configured as one of the two memory management systems based on the trigger of a user. The following describes a memory management system configuration method.

As shown in fig. 3, a method for configuring a memory management system according to an embodiment of the present application includes:

step 301: at initial boot-up of the device, the mode of each DCPMM is configured by BIOS, such as selecting AD, MIX or MM modes.

Step 302: after the IMC-DCPMM mirror patrol writeback function is started, a user is prompted to select a performance mode (corresponding to a first memory management system) or a cost mode (corresponding to a second memory management system).

The IMC-DCPMM mirror image patrol inspection writeback function refers to that after equipment is started, fault detection is carried out on the DCPMM managed by the IMC110, and under the condition that the DCPMM managed by the IMC110 is detected to have a fault, data writeback is carried out on the DCPMM with the fault by using data in a standby storage space.

It should be noted that what is detected in the IMC-DCPMM mirror polling writeback function is DCPMM managed by the IMC as the master, that is, DCPMM forming the main storage space. In the embodiment of the present application, a storage space managed by the IMC110 is taken as an example of a main storage space, and when a storage space managed by the IMC120 is taken as a main storage space, the DCPMM detected in the IMC-DCPMM mirror image polling writeback function is the DCPMM managed by the IMC 120; when the storage space managed by the IMC110 and the IMC120 is the main storage space, the DCPMM detected in the copy-back function of the IMC-DCPMM mirror image patrol is the DCPMM managed by the IMC110 and the IMC120.

If the user selects the performance mode.

Step 303: for IMC110 and IMC120 in CPU100, address mapping is performed by the BIOS on memory space managed by IMC110 and memory space managed by IMC120.

When address mapping is performed between a memory space formed by DCPMMs managed by IMC110 and a memory space formed by DCPMMs managed by IMC120, the memory space managed by IMC110 may be regarded as one Block (Block), and then the Block may be divided into sub-blocks, and an address may be assigned to each sub-Block.

Similarly, the storage space managed by IMC120 is considered as a Block (Block), and then the Block is divided into sub-blocks, and each sub-Block is assigned with an address; and establishing a mapping relation between the addresses of each sub-block under the IMC110 and each sub-block under the IMC120.

Step 304: the storage space managed by IMC110 is configured as primary storage space and the storage space managed by IMC120 is configured as backup storage space.

Step 305: if the storage space managed by IMC110 is successfully configured as primary storage space, the storage space managed by IMC120 is configured as backup storage space, and the device is restarted.

If the storage space managed by IMC110 is not successfully configured as primary storage space, the storage space managed by IMC120 is configured as backup storage space, and the user is prompted for a failed configuration.

If the user selects the cost mode.

Step 306: the size of the non-volatile memory in the storage space managed by IMC110 is determined.

The BIOS may perform a capacity check on the storage medium (media) in each DCPMM by accessing a non-volatile memory (NVM controller) in each DCPMM, determine the size of the nonvolatile memory in the storage space managed by the IMC110, and then determine the size of the nonvolatile memory in the storage space.

Step 307: the RAID200 card configures a redundancy policy for RAID210, such as configuring RAID 1.

Step 308: the addresses are mapped to RAID210 and non-volatile memory in the storage space managed by IMC110 in the device.

In the embodiment of the present application, in order to ensure the reliability of data in the nonvolatile memory in the storage space managed by IMC110, only the nonvolatile memory in the storage space managed by IMC110 may be address-mapped to RAID 210. Of course, address mapping may also be performed on the storage space managed by IMC110 and RAID210, which not only can ensure reliability of data in the nonvolatile memory in the storage space managed by IMC110, but also can ensure reliability of data in the volatile memory in the storage space managed by IMC110, in this embodiment of the present application, address mapping is performed only on the nonvolatile memory in the storage space managed by IMC110 and RAID210, and a case of performing address mapping on the storage space managed by IMC110 and RAID210 is similar to a case of performing address mapping on the nonvolatile memory in the storage space managed by IMC110 and RAID210 in the device, and details are not described here.

It should be noted that, in the embodiment of the present application, an example is described in which the nonvolatile memory of the storage space managed by the IMC110 is configured as a main storage space, so in step 307, address mapping needs to be performed on the nonvolatile memory in the storage space managed by the IMC110 and the RAID 210.

If the nonvolatile memory of the storage space managed by the IMC120 is configured as a main storage space, address mapping needs to be performed on the nonvolatile memory and the RAID210 in the storage space managed by the IMC 120; if the nonvolatile memories of the storage spaces managed by IMCs 110 and 120 are configured as main storage spaces, the nonvolatile memories of the storage spaces managed by IMCs 110 and 120 and RAID210 need to be subjected to address mapping, and the setting may be performed according to an actual scene. These two situations are similar to the way that non-volatile memory of the storage space managed by IMC110 is configured as main storage space, and are not described here again.

When address mapping is performed on the non-volatile memory in the storage space managed by IMC110 and RAID210, the non-volatile memory in the storage space managed by IMC110 may be treated as one block, and then the block may be sub-block partitioned and an address may be assigned to each sub-block.

Similarly, the RAID210 is regarded as one block, and then the block is divided into sub blocks, and each sub block is given an address; a mapping relationship between addresses of each sub-block under IMC110 and each sub-block of RAID210 is established.

Step 309: non-volatile memory in the storage space managed by IMC110 is configured as primary storage space and RAID210 is configured as backup storage space.

Step 310: RAID210 is configured as a spare storage space if non-volatile memory in the storage space managed by IMC110 is successfully configured as primary storage space. Otherwise, prompting the user to fail configuration.

After the memory management system is configured, when the device performs a write operation in the main storage space, the same write operation can be performed in the spare storage space, so that the spare storage space can store a data backup of the main storage space. In addition, when the DCPMM managed by the IMC110 fails, a corresponding DCPMM failure handling method may also be adopted, and the following describes the DCPMM failure handling methods under the two different memory management systems:

first, a DCPMM fault handling method based on the memory management system architecture shown in fig. 1.

As shown in fig. 4, taking IMC110 as a main component and IMC120 as a standby component, a DCPMM fault handling method provided in an embodiment of the present application includes:

step 401: during device operation, each DCPMM managed by IMC110 in the device is detected.

The DCPMM includes a non-volatile memory (NVM controller) and a storage medium (media). The storage medium in the DCPMM is used for storing data. The NVM controller can actively monitor the storage medium in the DCPMM, determine the status of the storage medium in the DCPMM (e.g., the temperature of the storage medium, whether the storage medium has an uncorrectable error, etc.), and record the status in the failure monitoring information.

Whether each DCPMM fails can be determined in the embodiments of the present application by accessing the NVM controller in each DCPMM.

For any DCPMM managed by IMC110, if the NVM controller in the DCPMM is not successfully accessed, the NVM controller in the DCPMM fails, and the DCPMM fails.

If the NVM controller is successfully accessed, the failure monitoring information recorded by the NVM controller may be accessed to determine the state of the storage medium in the DCPMM, for example, when the storage medium has an irreversible condition such as an over-temperature condition, an uncorrectable error or an electrical damage, the storage medium is determined to be failed, and then the DCPMM is determined to be failed.

Step 402: when a DCPMM managed by IMC110 is detected to fail, the data read and write operations of the device are switched to the memory space managed by IMC120. That is, IMC110 is demoted to standby and IMC120 is promoted to primary.

Step 403: the failed DCPMM is data written back with data in the memory space managed by the IMC120.

Step 404: if the write-back is successful, after the next device is started, the primary-standby relationship between the IMC120 and the IMC110 is restored, the IMC110 is the primary device, and the IMC120 is the secondary device.

Step 405: if the write-back fails, an alarm message is sent to the user to prompt the DCPMM to have a fault, and the user can be informed to replace the DCPMM with the fault.

The user turns off the power supply of the device, replaces the failed DCPMM, initializes each component (such as a processor, a memory and the like) in the device after being powered on again, writes back data to the newly replaced DCPMM by using the data in the storage space managed by the IMC120 after the initialization is completed, and restores the primary-standby relationship between the IMC120 and the IMC110 if the write-back of the data is successful.

It should be noted that, in step 405, even if write-back of data fails, the primary-standby relationship between IMC120 and IMC110 is not unbound, so that address mapping can be maintained between the memory space managed by IMC120 and the memory space managed by IMC110, and in the DCPMM managed by IMC110, the failed DCPMM is removed, and the backup data of the data stored in the remaining DCPMM can still be stored in the memory space managed by IMC120.

Second, a DCPMM fault handling method based on the memory management system architecture shown in fig. 2.

As shown in fig. 5, a DCPMM fault handling method provided in this embodiment of the present application includes:

step 501: during the operation of the device, each DCPMM in the device is detected. As in step 401, reference may be specifically made to the related description of step 401, and details are not described here.

Step 502: when detecting that the DCPMM managed by the IMC110 fails, the data read/write operation of the device is switched to RAID 210. That is, RAID210 is upgraded to primary storage space, and the storage space managed by IMC110 is used as spare storage space.

Step 503: the DCPMM managed by IMC110 is written back with data stored in RAID.

For example, data stored in RAID may be compared with data in the storage space managed by IMC110, and for data in RAID that is inconsistent with data in the storage space managed by IMC110, the inconsistent data in the storage space managed by IMC110 may be erased, and data overwriting may be performed using the data stored in RAID.

Step 504: if the write-back is successful, after the next device is started, the primary-backup relationship between the IMC110 and the RAID210 is restored, wherein the IMC110 is primary and the RAID210 is secondary.

Step 505: if the write-back fails, an alarm message is sent to the user to prompt the DCPMM to have a fault, and the user can be informed to replace the DCPMM with the fault.

And the user closes the power supply of the equipment, replaces the failed DCPMM, initializes each component in the equipment after being electrified again, and writes back the data of the newly replaced DCPMM by using the data in the RAID210 after the initialization is finished. If the data write-back is successful, the primary-backup relationship between the IMC110 and the RAID210 is restored.

It should be noted that, in step 505, even if write-back of data fails, the primary and secondary relationships between IMC120 and RAID210 are not unbound, so that address mapping can be maintained between the storage space managed by RAID210 and the storage space managed by IMC110, and in the DCPMM managed by IMC110, the failed DCPMM is removed, and backup data of data stored in the remaining DCPMMs can still be stored in RAID 210.

Based on the same inventive concept as the method embodiment, an embodiment of the present application further provides a fault handling apparatus, configured to execute the methods shown in fig. 3, 4, and 5, and related features may refer to the method embodiment, which is not described herein again, and as shown in fig. 6, the fault handling apparatus 600 includes a detecting unit 601 and a switching unit 602.

A detecting unit 601, configured to detect that the first storage managed by the first storage controller fails.

A switching unit 602, configured to, when a failure occurs in a first memory managed by the first memory controller, transfer a data read/write operation of a device from a storage space formed by the first memory managed by the first memory controller to a storage space formed by a second memory managed by the second memory controller.

The failure processing apparatus 600 further includes a write-back unit 603, and the write-back unit 603 may perform data write-back on failure data in the first memory managed by the first memory controller using data in the second memory managed by the second memory controller; and if the write-back is successful, switching the read-write operation of the equipment to a storage space formed by the first storage managed by the first storage controller after the equipment is powered on next time. And if the write-back fails, sending alarm information, wherein the alarm information is used for indicating that the first memory fails.

The failure processing apparatus 600 further includes a configuration unit 604, where the configuration unit 604 may map a storage space of the first storage managed by the first storage controller to a logic space of the first storage controller, and may further receive an instruction sent by a user before the detection unit detects that the first storage managed by the first storage controller fails, where the instruction is used to instruct to configure a first mode or a second mode, where the first mode indicates that the first storage and the second storage are DCPMM, the second mode indicates that the first storage is DCPMM, and the second storage is a disk; configuring the first mode or the second mode according to the instruction.

The fault handling apparatus 600 may be configured to execute the methods shown in fig. 3, 4 and 5, wherein the detecting unit 601 may execute the method of detecting that the DCPMM managed by the IMC110 fails in step 401 and step 402 in the method embodiment shown in fig. 4, and may also execute the method of detecting that the DCPMM managed by the IMC110 fails in step 501 and step 502 in the method embodiment shown in fig. 5. The switching unit 602 may perform a switching method of a data read operation in step 402 in the method embodiment shown in fig. 4 and a switching method of a data read operation in step 502 in the method embodiment shown in fig. 5. The write-back unit 603 may perform steps 403 to 405 in the method embodiment shown in fig. 4, and may also perform steps 503 to 505 in the method embodiment shown in fig. 5. Configuration unit 604 may perform steps 303-310 in the method embodiment as shown in fig. 3.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. The functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or contributing to the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The device 700 shown in fig. 7 comprises at least one processing unit 710, a first memory 720, a second memory 730, and optionally a communication interface 740.

The processing unit 710 includes a first memory controller and a second memory controller, and may also include one or more general-purpose processors, such as the CPU100, or a combination of the CPU100 and hardware chips. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The number and types of processors included in the processing unit 710 are not limited in the embodiments of the present application, and any processor capable of calling computer program instructions in a memory (such as the first memory 720, the second memory 730, or other memories) may form the processing unit 710.

The embodiment of the present application does not limit the specific connection medium between the processing unit 710 and the memory 720.

In the device of fig. 7, a communication interface 730 is further included, and the processing unit 710 can perform data transmission through the communication interface 730 when communicating with other devices.

When the apparatus takes the form shown in fig. 7, the processing unit 710 in fig. 7 may cause the apparatus 700 to perform the methods shown in fig. 3, 4 and 5 described above by calling computer program instructions stored in a memory, such as the first memory 720, the second memory 730 or other memories.

Specifically, the functions/implementation processes of the detection unit, the switching unit, the write-back unit, and the configuration unit in fig. 7 may all be implemented by the processing unit 710 in fig. 7 calling a computer execution instruction stored in the memory.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

Claims

1. A server, comprising a BIOS, a first storage controller and a second storage controller, wherein the BIOS is connected to the first storage controller and the second storage controller, the server configures the first storage controller and the second storage controller via the BIOS, wherein the first storage controller manages a first storage, and the second storage controller manages a second storage;

when a first storage managed by the first storage controller fails, data read-write operation of the server is transferred from a storage space formed by the first storage managed by the first storage controller to a storage space formed by a second storage managed by the second storage controller.

2. The server according to claim 1, wherein the second storage controller manages the second storage to form a storage space that is a backup of the storage space of the first storage managed by the first storage controller, and wherein the second storage controller manages the second storage to form a storage space that does not overlap with the storage space of the first storage managed by the first storage controller.

3. The server according to claim 1 or 2, wherein the first storage and the second storage are data center level persistent memory (DCPMM), and the first storage controller and the second storage controller are memory controllers.

4. The server according to claim 1 or 2, wherein the first storage is a data center level persistent memory (DCPMM), the second storage is a plurality of disks, the first storage controller is a memory controller, and the second storage controller is a redundant array of disks (RAID) card.

5. The server of claim 1 or 2, wherein data in a storage space of a first memory managed by the first memory controller is distributed among DCPMMs under multiple channels of the first memory controller.

6. The server of claim 1 or 2, wherein the server configures a mode of the first memory controller and the second memory controller by the BIOS to be a first mode or a second mode, the first mode indicating the first memory and the second memory to be DCPMMs, the second mode indicating the first memory to be DCPMMs, the second memory to be disks.

7. The server according to claim 1 or 2, wherein the first memory and the second memory are DCPMMs, and data write back is performed by the BIOS using data in the second memory managed by the second memory controller on the failure data in the first memory managed by the first memory controller after the failure of the first memory managed by the first memory controller.

8. The server according to claim 1 or 2, wherein the first storage is a DCPMM, the second storage is a plurality of disks, after the first storage managed by the first storage controller fails, the BIOS sends failure information to the OS of the server, and the OS of the server performs data write-back on the failure data in the first storage managed by the first storage controller by using data in the second storage managed by the second storage controller, and returns a message to the BIOS after the write-back is successful.

9. The server of claim 1 or 2, wherein the first memory and the second memory are DCPMMs, the server to address map memory spaces formed by the DCPMMs managed by the first memory controller and memory spaces formed by the DCPMMs managed by the second memory controller via the BIOS.

10. The server according to claim 1 or 2, wherein the first storage is a DCPMM, the second storage is a plurality of disks, the second storage controller is a RAID card, the plurality of disks form a RAID, the RAID card configures a redundancy policy for the RAID, and the server performs address mapping on the nonvolatile flash memory in the DCPMM managed by the first storage controller and the RAID managed by the second storage controller by the BIOS.