CN105868038B

CN105868038B - Memory error processing method and electronic equipment

Info

Publication number: CN105868038B
Application number: CN201610183532.2A
Authority: CN
Inventors: 刘峰; 杨立中
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2016-03-28
Filing date: 2016-03-28
Publication date: 2020-03-24
Anticipated expiration: 2036-03-28
Also published as: CN105868038A

Abstract

The method comprises the steps of firstly determining an object using a memory unit after detecting that a preset type of error occurs in the memory unit, and executing error processing operation aiming at the memory unit if the object is applied by a user, wherein the error processing operation is not restarting operation. The method can execute non-restarting error processing operation on the memory unit under the condition that a user of the error memory is applied by a user, and avoids the damage of direct downtime or restarting on service continuity. In addition, the application also provides electronic equipment to ensure the application and implementation of the method in practice.

Description

Memory error processing method and electronic equipment

Technical Field

The present application relates to the field of memory management technologies, and in particular, to a memory error handling technology.

Background

The production server, corresponding to the development server, is required to constantly meet the access demand of a large amount of data as a server for putting into service use. It will be appreciated that a high access rate will increase the probability of errors in the memory of the production server.

At present, after an error occurs in a memory of a production server, shutdown or restart processing is directly performed. However, this simple and rough handling may break the continuity of the service.

Disclosure of Invention

In view of this, the present application provides a memory error processing method to solve the technical problems that the existing processing method destroys the continuity of the service and is not favorable for the follow-up troubleshooting of the error reason. In addition, the application also provides electronic equipment for ensuring the application and implementation of the method in practice.

In order to achieve the purpose, the technical scheme provided by the application is as follows:

in one aspect, the present application provides a memory error handling method applied to a central processing unit, including:

after obtaining the information that the memory unit has a preset type error, determining an object using the memory unit;

executing error processing operation aiming at the memory unit under the condition that the object is the user application; wherein the error handling operation is not a restart operation.

Optionally, in the memory error handling method, the executing the error handling operation for the memory unit includes:

when the central processing unit is one of the central processing units in the central processing unit cluster, preventing all the central processing units in the central processing unit cluster from reusing the error memory space on the memory unit; and all the central processors in the central processor cluster access the memory unit in a concurrent mode.

Optionally, in the memory error handling method, when the local central processing unit is one central processing unit in a central processing unit cluster, the operation of preventing all central processing units in the central processing unit cluster from reusing the error memory space in the memory unit includes:

and under the condition that the central processing unit is the central processing unit which uses the memory unit currently in the central processing unit cluster, broadcasting error interrupt in the central processing unit cluster in an inter-core interrupt mode so as to enable all the central processing units in the central processing unit cluster to stop reusing the memory unit.

Optionally, in the memory error handling method, a first central processing unit in the central processing unit cluster that receives the error interrupt is a master central processing unit, and other central processing units are slave central processing units;

all central processing units in the central processing unit cluster stop reusing the memory unit, and the method comprises the following steps:

and under the condition that the central processing unit is a main central processing unit, inquiring all tasks using the error memory space of the memory unit, and sending error signals to a target central processing unit running the tasks in the central processing unit cluster so as to enable the target central processing unit to trigger user applications corresponding to the respective tasks to execute error processing operation aiming at the memory unit.

Optionally, in the memory error handling method, before querying all tasks using the error memory space of the memory unit, the method further includes:

determining the error of the memory cell to be a preset high-level error.

Optionally, the memory error handling method further includes:

and under the condition that the central processing unit is one central processing unit in the central processing unit cluster and the object is a system kernel, storing system information related to the error, and broadcasting a restart signal in the central processing unit cluster to restart the system.

Optionally, the memory error handling method further includes:

recording the state information of the central processing unit in an extended register under the condition that the central processing unit is the central processing unit using the memory unit in the central processing unit cluster;

correspondingly, the specific step of determining that the object is a system kernel includes:

and determining the object as a system kernel according to the state information in the extension register.

In yet another aspect, the present application provides an electronic device comprising:

the memory unit is provided with an ECC chip;

the central processing unit is connected with the ECC chip through an extended error interrupt connection line and is used for determining an object using the memory unit after acquiring information that a preset type error occurs in the memory unit and executing error processing operation aiming at the memory unit under the condition that the object is applied by a user; wherein the error handling operation is not a restart operation.

Optionally, the electronic device includes a central processing unit cluster, the central processing unit cluster accesses the memory unit in a concurrent manner, and the central processing unit is one of the central processing unit clusters;

in the aspect of performing the error handling operation for the memory unit, the cpu is configured to:

and preventing all central processing units in the central processing unit cluster from reusing the error memory space on the memory unit.

Optionally, in the electronic device, the central processing unit is a central processing unit in the central processing unit cluster that currently uses the memory unit;

in terms of preventing all central processors in the central processor cluster from re-using the faulty memory space on the memory unit, the central processors are configured to:

and broadcasting error interrupt in the central processing unit cluster in an inter-core interrupt mode so as to stop all central processing units in the central processing unit cluster from reusing the memory unit.

Optionally, in the electronic device, the central processing unit is a central processing unit that first receives the error interrupt;

in an aspect in which the central processing unit causes all central processing units in the central processing unit cluster to stop reusing the memory unit, the central processing unit is configured to:

and inquiring all tasks using the error memory space of the memory unit, and sending error signals to a target central processing unit running the tasks in the central processing unit cluster so as to enable the target central processing unit to trigger user applications corresponding to the respective tasks to execute error processing operation aiming at the memory unit.

Optionally, the central processing unit of the electronic device is further configured to:

determining the error of the memory unit as a preset high-level error before querying all tasks using the error memory space of the memory unit.

Optionally, in the electronic device, the central processor is a central processor in a central processor cluster, and an object determined by the central processor is a system kernel;

the central processing unit is further configured to:

and saving system information related to the error, and broadcasting a restart signal in the central processing unit cluster to restart the system.

Optionally, in the electronic device, an extension register is disposed on the central processing unit, and the central processing unit is a central processing unit in the central processing unit cluster that uses the memory unit;

the central processing unit is further configured to: recording the state information of the central processing unit in the extended register;

accordingly, in determining that the object is a system kernel, the central processor is configured to:

According to the foregoing technical solutions, the present application provides a memory error handling method, where after a memory unit is detected to have a preset type of error, an object using the memory unit is first determined, and if the object is applied by a user, an error handling operation for the memory unit is executed, and of course, the error handling operation is not a restart operation. The method can execute non-restarting error processing operation on the memory unit under the condition that a user of the error memory is applied by a user, and avoids the damage of direct downtime or restarting on service continuity.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a memory error processing method according to embodiment 1 of the present application;

FIG. 2 is a diagram illustrating an exemplary configuration of an electronic device provided herein;

fig. 3 is a flowchart of a memory error processing method according to embodiment 2 of the present application;

FIG. 4 is a flowchart illustrating various processing operations performed on various objects of use of a memory unit according to the present application;

fig. 5 is a diagram illustrating a structure of an extension register provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

An electronic device such as a production server is provided with a memory unit and a Central Processing Unit (CPU). The memory unit, such as a memory bank, stores data that can be accessed by the central processing unit. When the central processing unit accesses the memory unit, if the memory unit is in error, access failure may be caused, and further, the electronic device may be down or restarted.

Therefore, the application provides a memory error processing method applied to a central processing unit. Referring to fig. 1, a flow of an embodiment 1 of a memory error handling method is shown. As shown in fig. 1, the present embodiment may specifically include steps S101 to S102.

Step S101: and after the information that the preset type error occurs in the memory unit is obtained, determining the object using the memory unit.

After the memory unit has a preset type of error, it can send error information to the central processing unit. Specifically, an ECC (Error correction Code) chip may be disposed on the memory unit, and the ECC chip may monitor the memory unit, and send an Error message to the central processing unit if a multi-bit Error occurs in the memory unit.

Of course, the predetermined type of error includes, but is not limited to, multi-bit errors, and other types of errors in which the memory cell is not self-recoverable.

After obtaining the error message of the memory unit, the central processing unit needs to determine the object currently using the memory unit. It will be appreciated that the objects using memory locations may be system kernels or user applications. The error handling operation will be different for different objects of use. If the object is a system kernel, the system kernel can be restarted after saving information related to the error, wherein the saved information is used for subsequent analysis of the error. If the usage object is a user application, step S102 may be performed.

Step S102: executing error processing operation aiming at the memory unit under the condition that the object is the user application; wherein the error handling operation is not a restart operation.

If the object using the memory unit is a user application, the cpu performs an error handling operation for the memory unit. In order to guarantee the continuity of the operation of the electronic device, in particular in the case where the electronic device is a production server, the error handling operation is not a restart operation in order to guarantee the continuity of the service of the server.

It should be noted that the error handling operation is performed for the memory unit with the error, such as preventing all the visitors from accessing the memory unit, preventing the memory unit from being reallocated, and so on.

According to the technical scheme, the memory error processing method provided by the embodiment is applied to the central processing unit, after the central processing unit receives the error information of the memory unit, the using object of the memory unit is firstly determined, and if the using object is applied by a user, the non-restarting error processing operation on the memory unit is executed, so that the damage to the service continuity caused by direct shutdown or restarting is avoided.

In one particular example, the error handling operation on the memory unit may be to prevent a reuse operation of the faulty memory space on the memory unit.

Specifically, the memory pages (or referred to as page structures) corresponding to the faulty memory space may be marked as a damaged state, and further, the nodes corresponding to the faulty memory space may be isolated from the memory structure tree corresponding to the memory pages. If the node is isolated from the memory structure tree, the error memory space corresponding to the node can not be distributed for use any more.

The memory pages are page structures which are preset by a system and used for managing the memory space with errors, and the initial address of the memory space, whether the memory space has errors or not and which node of the memory structure tree the memory space is associated with can be recorded in the memory pages.

In another specific example, the error handling operation on the memory unit may be sending an error signal to the user application to trigger the user application to invoke a preset error handling program, such as main/standby switching, a restart process, and the like. Of course, if the user application is not pre-configured with the error handling program, the operating system kills the process of accessing the error memory space in the user application.

In another specific example, the electronic device may be provided with a plurality of central processing units, and the plurality of central processing units may operate in parallel (or referred to as backup), for example, if a certain central processing unit fails to operate normally, other central processing units may operate as backup to execute the operation of the central processing unit. The plurality of central processing units may be referred to as a cluster of central processing units, which may operate in a parallel manner. In particular, in the application scenario of the present application, the central cluster may access the memory units in a parallel manner.

Referring to fig. 2, a diagram of one example of a structure of an electronic device is provided. As shown in fig. 2, the electronic device includes two central processing units CPU and two memory units, and each memory unit is connected to the two central processing units CPU through an extended interrupt connection line.

In this specific example, if the central processing unit is a central processing unit in a central processing unit cluster, the error handling operation performed by the central processing unit for the memory unit may be:

and preventing all central processing units in the central processing unit cluster from reusing the error memory space on the memory unit so as to prevent other central processing units from accessing the error memory space.

More specifically, if the ECC chip on the memory unit detects that the predetermined type of error occurs in the memory unit, an error interrupt (e.g., NMI error interrupt in fig. 2) may be sent to the cpu currently using the memory unit. After the central processing unit receives the error interrupt, the error interrupt can be broadcast to all central processing units in the central processing unit cluster in an inter-core interrupt mode.

Therefore, before the memory error processing is finished, the access of all the central processing units in the central processing unit cluster to the error memory space on the memory unit can be stopped, and the memory error is prevented from being diffused.

All the central processing units in the central processing unit cluster can receive the broadcast error interrupt, but the receiving sequence is not fixed. The first central processing unit receiving the error interrupt is a master central processing unit, and the other central processing units are slave central processing units. It should be noted that the main cpu is not necessarily the processor that is currently using the memory unit.

In particular, the operating system may pass the error interrupt on to all central processors in the cluster of central processors and call a respective interrupt handling function for each central processor. The first central processing unit which receives the error terminal can mark itself as a main central processing unit, and the other central processing units mark itself as slave central processing units.

As shown in fig. 2, each cpu is further provided with an extension register based on the original register. The master central processing unit and the slave central processing units read the data in the self expansion registers into a global memory which is visible to all the central processing units in the central processing unit cluster.

Referring to fig. 3, a flow chart of an error handling method embodiment 2 is shown. As shown in fig. 3, after the central processing unit (referred to as the current CPU) currently using the memory unit receives the error interrupt sent by the ECC chip, information related to the error of the memory unit may be written into the extension register.

The information related to the memory unit error may specifically include an address of the memory space in which the memory unit has failed and status information of the cpu. The state information of the central processing unit may be, for example, a kernel state or an application state. The kernel mode represents that the central processor is currently used as a system kernel, and the application mode represents that the central processor is currently used as a user application.

Therefore, no matter whether the central processing unit currently using the memory unit is a master central processing unit or a slave central processing unit, after the operation of writing all the central processing units into the global memory is finished, the global memory necessarily contains information related to the error of the memory unit.

And the main central processing unit reads the data in the global memory, determines whether the object using the memory unit currently is a system kernel or a user application according to the state information of the central processing unit in the read data, and further executes different processing operations on different used objects.

Referring to FIG. 4, a flowchart of different processing operations performed on different usage objects of a memory cell is shown.

Step S401: and the main central processing unit determines the object of the currently used memory unit according to the data in the global memory. If the object is a system kernel, step S402 is executed. If the object is a user application, step S403 is executed.

Step S402: and the main central processing unit stores the information related to the error into a disk file and sends a restart signal to all central processing units in the central processing unit cluster to restart.

Step S403: the main central processing unit marks the fault memory space as a damaged state in the memory paging (or called page structure) corresponding to the fault memory space, and isolates the node corresponding to the fault memory space from the memory structure tree corresponding to the memory paging.

Step S404: the main central processing unit inquires all tasks using the error memory space of the memory unit and sends error signals to a target central processing unit running the tasks in the central processing unit cluster so that the target central processing unit triggers user applications corresponding to the respective tasks to execute error processing operation aiming at the memory unit.

After finding the memory pages of the memory space with errors, the main central processing unit may further find the node corresponding to the memory space with errors in the memory structure tree, where the node records all tasks using the memory space with errors. Before isolating the node from the memory structure, the main central processing unit may record all tasks, for example, add all tasks to the task chain table, and then read all tasks from the task chain table.

It should be noted that the task is a task running on the central processing unit, and all the read tasks may run on any one or more central processing units in the central processing unit cluster. Determining all the central processing units corresponding to all the tasks according to the corresponding relationship between the tasks and the central processing unit in which the tasks are operated, wherein the central processing unit can be called a target central processing unit for convenience of description. And the main central processing unit sends error signals to all target central processing units.

And after receiving the error signal, the target central processing unit triggers the user application corresponding to each task to execute error processing operation aiming at the memory unit.

Specifically, the target central processing unit queries a user application corresponding to each running task, and sends an error signal to the user application, and after receiving the error signal, the user application may execute a preset error processing operation for the memory unit, such as main/standby switching, and task restarting.

Of course, before the step S404 is executed, it may also be determined whether the error of the memory unit is a predetermined high-level error. If yes, step S404 is executed. Otherwise, adding the notification linked list to the error event of the memory unit, notifying the system to reload the error memory page, and executing step S404 when the notification linked list is called. Therefore, the user application is triggered to execute the error processing operation immediately only under the condition that the memory error is a high-level error, and if the memory error is not the high-level error, the user application is triggered to execute the error processing operation after being delayed.

Referring to fig. 5, an exemplary diagram of an extended register provided herein is shown. As shown in FIG. 5, the extension register can record four items of data, CTL, STATUS, ADDR and MISC, respectively.

Wherein CTL is the functional open bit. CTL writes 1, which means the cpu setting the extension register turns on the error handling function provided by the present application.

The STATUS is an error handling STATUS, and if the STATUS is 0, it indicates that the cpu in the extension register is a cpu affected by an error.

ADDR is the physical address of the memory unit that the cpu is accessing when the memory unit has an error.

The MISC is the state of the central processing unit such as CPSR, IP and the like when the memory unit has errors. Meanwhile, it can also record the type of the memory unit address, such as the virtual address or the physical address, and how many bits of the address are valid.

The electronic device provided in the present application is introduced below, and it should be noted that for the description of the electronic device, reference may be made to the memory error processing method provided above, and details are not described below.

See the exemplary diagram of the electronic device shown in fig. 2. As shown in fig. 2, the electronic device may specifically include: memory unit and central processing unit. Wherein:

the memory unit is provided with an ECC chip;

the central processing unit is connected with the ECC chip through the extended error interrupt connecting line and is used for determining an object using the memory unit after acquiring the information that the memory unit has a preset type error and executing error processing operation aiming at the memory unit under the condition that the object is a user application; wherein the error handling operation is not a restart operation.

The electronic device may include a cluster of central processors that access the memory units in a concurrent manner, the central processor being one of the central processor clusters.

The specific steps of the cpu performing the error handling operation for the memory unit may include:

In a specific example, the central processing unit is a central processing unit currently using a memory unit in a central processing unit cluster;

the specific steps of the central processing unit executing the operation of preventing all the central processing units in the central processing unit cluster from reusing the error memory space in the memory unit may include:

and broadcasting error interrupt in the central processor cluster in an inter-core interrupt mode so as to stop the reuse operation of the memory unit by all the central processors in the central processor cluster.

The central processing unit is the central processing unit which receives the error interrupt firstly. The specific steps of the central processing unit causing all the central processing units in the central processing unit cluster to stop the reuse operation of the memory unit may include:

and inquiring all tasks of the error memory space using the memory unit, and sending error signals to a target central processing unit running the tasks in the central processing unit cluster so as to enable the target central processing unit to trigger user applications corresponding to the respective tasks to execute error processing operation aiming at the memory unit.

In one specific example, the central processor is further configured to: before querying all tasks using the faulty memory space of the memory unit, determining the fault of the memory unit as a preset high-level fault.

In a specific example, the central processing unit is a central processing unit in a central processing unit cluster, and the object using the memory unit determined by the central processing unit is a system kernel, the central processing unit is further configured to:

system information associated with the error is saved and a reboot signal is broadcast in the cluster of central processors to perform a system reboot.

In one specific example, the central processing unit is provided with an extension register, and the central processing unit is a central processing unit using a memory unit in the central processing unit cluster. The central processing unit is further configured to: and recording the state information of the central processing unit in the extension register.

The specific steps of the central processing unit for determining the object as the system kernel comprise: and determining the object as a system kernel according to the state information in the extension register.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the same element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A memory error processing method is applied to a central processing unit, and comprises the following steps:

executing error processing operation aiming at the memory unit under the condition that the object is the user application; wherein the error handling operation is not a restart operation;

wherein the performing the error handling operation for the memory cell comprises:

2. The method of claim 1, wherein when the cpu is one of the cpu clusters, the operation of preventing all the cpus of the cpu cluster from reusing the faulty memory space in the memory unit comprises:

3. The memory error handling method of claim 2, wherein the first central processing unit in the central processing unit cluster that receives the error interrupt is a master central processing unit, and the other central processing units are slave central processing units;

4. The method of claim 3, further comprising, before querying all tasks that use the corrupted memory space of the memory cell:

determining the error of the memory cell to be a preset high-level error.

5. The memory error handling method of claim 1, further comprising:

6. The memory error handling method of claim 5, further comprising:

7. An electronic device, comprising a central processing unit cluster accessing a memory unit in a concurrent manner, wherein the central processing unit is one of the central processing unit clusters; the method comprises the following steps:

the memory unit is provided with an ECC chip;

the central processing unit is connected with the ECC chip through an extended error interrupt connection line and is used for determining an object using the memory unit after acquiring information that a preset type error occurs in the memory unit and executing error processing operation aiming at the memory unit under the condition that the object is applied by a user; wherein the error handling operation is not a restart operation;

wherein, in said performing the error handling operation for the memory unit, the central processing unit is to:

8. The electronic device of claim 7, wherein the central processor is a central processor of the cluster of central processors that is currently using the memory unit;

9. The electronic device of claim 8, wherein the central processor is the first central processor to receive the error interrupt;

10. The electronic device of claim 9, wherein the central processor is further configured to:

11. The electronic device of claim 7, wherein the central processor is one of a cluster of central processors and the object determined by the central processor is a system kernel;

the central processing unit is further configured to:

12. The electronic device according to claim 11, wherein an extension register is disposed on the central processing unit, and the central processing unit is a central processing unit of the central processing unit cluster that is using the memory unit;