CN111625387B - Memory error processing method, device and server - Google Patents

Memory error processing method, device and server Download PDF

Info

Publication number
CN111625387B
CN111625387B CN202010464731.7A CN202010464731A CN111625387B CN 111625387 B CN111625387 B CN 111625387B CN 202010464731 A CN202010464731 A CN 202010464731A CN 111625387 B CN111625387 B CN 111625387B
Authority
CN
China
Prior art keywords
error
memory
memory error
determining
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010464731.7A
Other languages
Chinese (zh)
Other versions
CN111625387A (en
Inventor
陈国民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202010464731.7A priority Critical patent/CN111625387B/en
Publication of CN111625387A publication Critical patent/CN111625387A/en
Application granted granted Critical
Publication of CN111625387B publication Critical patent/CN111625387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

The invention provides a memory error processing method, a memory error processing device and a server, and relates to the technical field of computers, wherein the memory error processing method comprises the following steps: when error reporting information of a memory error is received, current system state information is obtained; judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure. According to the embodiment of the invention, the types of the using processes corresponding to the memory errors are subdivided, and the memory errors of different types of using processes are correspondingly processed, so that the influence of the memory errors on the cloud service can be reduced, and the stability of the cloud service is improved.

Description

Memory error processing method, device and server
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a server for processing memory errors.
Background
Stability is the basis of cloud services, and public cloud manufacturers' core competitiveness. In order to improve the stability of the system, cloud manufacturers need to deal with various unexpected downtime problems caused by server hardware problems.
In order to avoid the downtime of the server caused by the memory fault, the current main method capable of detecting and timely processing the memory fault is to sense the memory fault in real time and process the memory fault through a intel MCA Recovery mechanism; however, the granularity of the mode is too large, log recording can be only carried out, the memory is isolated off line, and the processes and the system downtime are killed; when the memory error log is recorded, the mode needs to automatically or manually perform complete migration of the virtual machine.
Therefore, the granularity of the existing memory error processing method is larger, the cloud service is still greatly influenced by the memory fault, and the stability of the cloud service is poor.
Disclosure of Invention
Accordingly, the present invention aims to provide a memory error processing method, apparatus and server, which can timely perform finer granularity classification processing on memory errors, further reduce the influence of memory errors on cloud services, and improve the stability of cloud services.
In a first aspect, an embodiment of the present invention provides a memory error processing method, including: when error reporting information of a memory error is received, current system state information is obtained; judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure.
In a preferred embodiment of the present invention, the error reporting information includes a memory error address and a memory error type; the step of determining the processing measure for the memory error according to the judging result and the error reporting information comprises the following steps: if the memory error is recovered by hardware, determining a memory block to which the memory error address belongs; determining the error times of the memory block according to the historical error log of the memory block; and if the error times are greater than a preset times threshold, the memory block is disconnected, and an error log is recorded.
In a preferred embodiment of the present invention, the method further comprises: if the error number is smaller than a preset number threshold, an error log is recorded, and the content of the error log comprises the memory error address and the memory error type.
In a preferred embodiment of the present invention, the error message includes a memory error address and a memory error type; the step of determining the processing measure for the memory error according to the judging result and the error reporting information comprises the following steps: if the memory error is not recovered by hardware, searching a using process corresponding to the memory error address; determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
In a preferred embodiment of the present invention, the step of determining the measure for handling the memory error according to the process type of the using process includes: if the process type is a user space process, judging whether the using process is a key service process according to a preset process priority level; if the service process is not a key service process, the error log is recorded, a termination signal is sent to the using process, and the memory leaves corresponding to the using process are isolated from the line.
In a preferred embodiment of the present invention, the method further comprises: if the using process is a key service process, recording an error log, sending a termination signal to the using process, and isolating the memory leave corresponding to the using process from the line; the use process is restarted.
In a preferred embodiment of the present invention, the step of determining the measure for handling the memory error according to the process type of the using process includes: if the process type is the kernel process of the operating system, returning system error information to the user; the operating system is restarted.
In a preferred embodiment of the present invention, the step of determining the measure for handling the memory error according to the process type of the using process includes: if the process type is a virtual machine process, recording an error log; the error reporting information is sent to a virtual machine system corresponding to the virtual machine process, so that after the virtual machine system receives the error reporting information, the current system state information is obtained, and whether the memory error is recovered by hardware or not is judged according to the current system state information, so that a judgment result is obtained; determining a processing measure for the memory error according to the judging result and the error reporting information, so as to process the memory error through the processing measure until the memory error is processed.
In a second aspect, an embodiment of the present invention further provides a memory error processing apparatus, including: the system state information acquisition module is used for acquiring current system state information when receiving error reporting information of a memory error; the judging module is used for judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and the memory error processing module is used for determining a processing measure aiming at the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure.
In a preferred embodiment of the present invention, the error message includes a memory error address and a memory error type; the processing measure determining module is further used for: if the memory error is recovered by hardware, determining a memory block to which the memory error address belongs; determining the error times of the memory block according to the historical error log of the memory block; and if the error times are greater than a preset times threshold, the memory block is disconnected, and an error log is recorded.
In a preferred embodiment of the present invention, the error reporting information includes a memory error address and a memory error type; the processing measure determination module is further configured to: if the memory error is not recovered by hardware, searching a using process corresponding to the memory error address; determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
In a third aspect, an embodiment of the present invention further provides a server, where the server includes a processor and a memory, where the memory stores computer executable instructions executable by the processor, and the processor executes the computer executable instructions to implement the memory error handling method described above.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the memory error handling method described above.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a memory error processing method, a memory error processing device and a server, wherein when error reporting information of a memory error is received, current system state information is obtained; then judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure. In the method, the types of the memory errors are subdivided, and corresponding processing is carried out on the memory errors of different types, so that the influence of the memory errors on the cloud service can be reduced, and the stability of the cloud service is improved.
Additional features and advantages of the disclosure will be set forth in the description which follows, or in part will be obvious from the description, or may be learned by practice of the techniques of the disclosure.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a memory error handling method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another memory error handling method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a memory error handling apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present invention.
Icon: 31-a system state information acquisition module; 32-a judging module; 33-a memory error processing module; 41-a processor; 42-memory; 43-bus; 44-communication interface.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Along with the gradual expansion of the capacity of a single memory bank, the memory bank has been developed from a few M to a few tens G, and the number of memory banks configured by a server is increased, so that the overall memory capacity is increased; in addition, the running frequency of the memory is also higher and higher, so that the problem of downtime caused by the failure fault of the memory has the highest proportion among the problems of downtime caused by the server components.
The cloud host is the most basic cloud service provided by public cloud manufacturers and runs on a physical server. Because a plurality of cloud hosts may run on the physical server, each cloud host belongs to a different client, downtime cost of the physical server in the cloud environment is also higher and higher.
In order to avoid downtime of the server caused by a memory failure, the following methods are mainly available at present:
(1) Before the server leaves the factory or goes on line, the memory bank is detected and pressed to find out the memory bank with problems, and the memory bank is replaced in advance. However, this can only be checked at a certain point in time, cannot be detected in real time, and cannot detect an abnormality occurring in the running process of the memory bank after the server is brought on line.
(2) Memory mirroring techniques are utilized. The memory mirroring technique is started for some special services, so that the problem caused by single memory or single memory block faults can be avoided. However, this method increases the amount of memory to be used by a multiple, and this increases the cost.
(3) And performing isolation offline operation on a certain wrong memory block in the OS or the BIOS through a memory isolation technology. However, this method can only temporarily perform the post-processing of the occurrence of the memory failure, and cannot timely sense and isolate the memory error, for example, the memory error during the restarting or the OS running process.
(4) Through the mechanism of intel MCA (Machine Check Architecture, machine inspection system) Recovery, the memory fault is perceived and processed in real time. However, the granularity of the current mechanism is too large, log recording can only be carried out, the memory is isolated off line, and the processes and the system downtime are killed; and when the memory error log is recorded, the complete migration of the virtual machine is automatically or manually carried out.
In consideration of the problem that the granularity of the existing method capable of detecting and processing the memory errors in real time is large and the cloud service is still greatly affected by the memory faults, the invention provides a memory error processing method, a memory error processing device and a memory error processing server. For the sake of understanding the present embodiment, first, a memory error processing method disclosed in the present embodiment is described in detail.
Referring to fig. 1, a flow chart of a memory error processing method provided by an embodiment of the invention is shown, and as can be seen from fig. 1, the method includes the following steps:
step S102: and when error reporting information of the memory error is received, acquiring current system state information.
Here, the error reporting information of the memory on the server can be obtained through Intel MCA hardware. Intel, among other things, introduces MCA and MCE (Machine Check Exception ) mechanisms to self-check server hardware and issue interrupts or exceptions when hardware errors are found. In actual operation, when the system software receives an interrupt or abnormality, the system software responds to the interrupt or abnormality and performs corresponding actions such as repairing, alarming or other strategies. Through the RAS (Remote Access Service ) characteristic of Intel, the server can be guaranteed to have a certain fault tolerance treatment before crash and other errors occur, and the competitive strength of Intel in the field of high-reliability servers of data centers is greatly improved.
In current computer systems, various hardware errors may reduce the stability of upper software systems, such as system bus errors, processor errors, ECC (Error Correcting Code, error checking and correction) errors, parity errors, cache errors, TLB (Translation Lookaside Buffer, translation look-aside buffer) errors, IO errors, disk errors, and the like. Also, there are many types of memory hardware related errors, such as: DIMM (Dual inline memory modules) errors, bit reversal errors, ECC check errors, write errors, read errors, etc.
In this embodiment, once the hardware error information related to the memory is encountered, the hardware platform notifies the upper software platform through an interrupt signal, and when the software platform receives the error information of the memory error, the current system state information is obtained, where the system state information includes various special state register (MSR) information of the CPU, such as an address of the error, whether an IP instruction register of the current CPU is valid, an error type of the current error, and the like.
Step S104: and judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result.
In one possible implementation manner, the current system state information further includes an interrupt signal, so that whether the error has been recovered by hardware can be determined through the error reporting information returned by the underlying hardware platform and the interrupt signal. For example: when the error message is sent through the CMCI (Corrected Machine-Check Error Interrupt, corrected machine check error interrupt) interrupt number (0 xf 9), it indicates that the memory error has been recovered by hardware, and when the error message is sent through the MCE interrupt number (0 x 12), it indicates that the memory error has not been recovered by hardware.
Step S106: and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure.
And respectively adopting different processing measures to process the memory errors according to different situations that the memory errors are recovered by the hardware and the memory errors are not recovered by the hardware. And when judging that the memory error is not recovered by hardware according to the current system state information, adopting corresponding memory recovery measures according to the type of the business process causing the memory error so as to minimize the influence of the memory error on the system. For example, the corresponding process information is found according to the error address, and different memory error recovery measures are respectively adopted for different service processes such as a system process, a common process, a client virtual machine process and the like, so that the fault error type of the memory is further subdivided, and different types of faults are processed.
Compared with the existing Intel MCA Recovery mechanism, the memory error processing method provided by the embodiment not only can sense and process the memory error in real time, but also distinguishes more detailed memory error types, and according to the type of the business process corresponding to the memory error, the processing measures are set pertinently, so that more timely and more detailed memory error processing and recovery can be performed, and the influence of memory faults is minimized.
The embodiment of the invention provides a memory error processing method, which acquires current system state information when receiving error reporting information of a memory error; then judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure. In the method, the types of the memory errors are subdivided, and corresponding processing is carried out on the memory errors of different types, so that the influence of the memory errors on the cloud service can be reduced, and the stability of the cloud service is improved.
On the basis of the memory error processing method shown in fig. 1, the present embodiment further provides another memory error processing method, which mainly describes a specific implementation process of determining a processing measure for a memory error according to a determination result and error reporting information, as shown in fig. 2, which is a flow chart of the memory error processing method, wherein the method includes the following steps:
step S202: when error reporting information of a memory error is received, current system state information is obtained; the error reporting information includes a memory error address and a memory error type.
In this embodiment, the received error information of the memory error includes a memory error address and a memory error type, where the memory error address may be read from a register, and the memory error type may be classified into a recoverable error and an unrecoverable error according to the error context information.
The recoverable error indicates that the context state of the CPU is still normal when the memory error occurs, and the memory error does not affect the subsequent operation of the CPU. Second, an unrecoverable error indicates that the CPU is in an unreliable state when a memory error occurs, or that the memory error may affect the subsequent execution of the CPU, e.g., the memory error in which the error handling module itself resides is an unrecoverable error.
Step S204: judging whether the memory error is recovered by hardware or not according to the current system state information; if yes, go to step S206; if not, step S208 is performed.
If the memory error is recovered by hardware, determining a memory block to which the memory error address belongs; if the memory error is not recovered by the hardware, searching the use process corresponding to the memory error address.
Step S206: and determining the memory block to which the memory error address belongs.
A memory block contains a plurality of memory addresses, and the memory block to which the memory block belongs can be determined according to the memory error address.
Step S208: and searching the use process corresponding to the memory error address.
In an operating system, each process has a private physical memory address space, so that, according to the memory error address, a used process corresponding to the memory error address can be found, where for the same memory address, there may be one or more processes used simultaneously.
Step S210: and determining the error times of the memory block according to the historical error log of the memory block.
The history error log of the memory block records the information of the fault error of the memory block, so that the error times of the memory block in a certain time period can be obtained. Assuming that the memory block includes 2 memory addresses a and B, when the memory address a is wrong or the memory address B is wrong, the memory block is recorded to be wrong.
Step S212: determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
If the process type is the user space process, judging whether the using process is a key service process according to a preset process priority level. Here, the process priority level may be preset according to an actual service type, which is not limited herein, and a process with a higher priority may be set as a key service process.
If the using process is not a key service process, an error log is recorded, a termination signal is sent to the using process, and a memory leaf corresponding to the using process is isolated. If the using process is a key service process, recording an error log, sending a termination signal to the using process, and isolating the memory leave corresponding to the using process from the line; the use process is then restarted. In one possible implementation, the termination signal is a KILL signal, and the process is killed by sending the KILL signal to the corresponding using process.
And if the process type is an operating system kernel process, returning system error information to the user; then, the operating system is restarted. The kernel of the operating system is the core of the operating system and is also the most basic part of the operating system. The kernel of the operating system is responsible for managing the processes, the memory, the device drivers, the files and the network system of the system, and determines the performance and the stability of the system. When the memory address corresponding to the kernel process of the operating system has a memory error, the system is not operated reliably, so that the system error information is returned to the user, and the operating system is restarted.
In addition, if the process type is a virtual machine process, an error log is recorded; the error reporting information is sent to a virtual machine system corresponding to the virtual machine process, after the virtual machine system receives the error reporting information, current system state information is acquired, and whether the memory error is recovered by hardware or not is judged according to the current system state information, so that a judgment result is obtained; determining a processing measure for the memory error according to the judging result and the error reporting information, so as to process the memory error through the processing measure until the memory error is processed.
In one possible implementation manner, the error reporting information may be sent to a virtualization process of a virtual operating system simulator (qemu) through a SIGBUS (Bus error), and then the qemu process sends the error reporting information to a virtual machine system (guest OS) through an interrupt, so that the virtual machine system performs recursive memory error processing. In this way, after receiving the error reporting information, the virtual machine system executes the memory error processing method in the virtual machine system, that is, different processing measures are respectively adopted to process the memory error according to whether the memory error is recovered by hardware; and when the memory error is not recovered by the hardware, adopting corresponding memory recovery measures according to the types of service processes (user space processes, operating system kernel processes and virtual machine processes) which cause the memory error until the memory error is processed.
Step S214: judging whether the error times are larger than a preset times threshold value or not; if yes, go to step S216; if not, step S218 is performed.
If the error number of the memory block is greater than the preset number threshold, it indicates that the error frequency of the memory block is higher, possibly the hardware is damaged, at this time, the memory block is down line, so as to avoid memory errors occurring again, and the service stability of the system is affected, and an error log is recorded.
If the error number of the memory block is smaller than the preset number threshold, an error log is recorded, wherein the content of the error log comprises the memory error address and the memory error type.
Step S216: and the memory block is offline, and an error log is recorded.
Step S218: and recording an error log, wherein the content of the error log comprises the memory error address and the memory error type.
According to the memory error processing method provided by the embodiment, by utilizing the Intel MCA hardware architecture, by distinguishing the memory fault type from the service process type, different memory error processing measures are adopted for the key service process, the non-key service process, the operating system kernel process and the virtual machine process respectively, and the memory fault error processing and recovery are performed in more detail, so that the influence of the memory error on cloud service can be reduced, and the stability of cloud service is improved.
Corresponding to the above memory error processing method, the present embodiment further provides a memory error processing apparatus, referring to fig. 3, which is a schematic structural diagram of the memory error processing apparatus, and as can be seen from fig. 3, the apparatus includes a system status information obtaining module 31, a judging module 32, and a memory error processing module 33 that are sequentially connected, wherein the functions of each module are as follows:
the system state information obtaining module 31 is configured to obtain current system state information when receiving error reporting information of a memory error;
a judging module 32, configured to judge whether the memory error is recovered by hardware according to the current system state information, so as to obtain a judging result;
the memory error processing module 33 is configured to determine a processing measure for the memory error according to the determination result and the error reporting information, so as to process the memory error through the processing measure.
The embodiment of the invention provides a memory error processing device, which acquires current system state information when receiving error reporting information of a memory error; then judging whether the memory error is recovered by hardware according to the current system state information to obtain a judging result; and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure. According to the device, the types of the memory errors are subdivided, and corresponding processing is carried out on the memory errors of different types, so that the influence of the memory errors on cloud services can be reduced, and the stability of the cloud services is improved.
In one possible implementation manner, the error reporting information includes a memory error address and a memory error type; the memory error processing module 33 is further configured to: if the memory error is recovered by hardware, determining a memory block to which the memory error address belongs; determining the error times of the memory block according to the historical error log of the memory block; and if the error times are greater than a preset times threshold, the memory block is disconnected, and an error log is recorded.
In another possible implementation manner, the memory error processing module 33 is further configured to: if the error number is smaller than a preset number threshold, an error log is recorded, and the content of the error log comprises the memory error address and the memory error type.
In another possible implementation manner, the error reporting information includes a memory error address and a memory error type; the memory error processing module 33 is further configured to: if the memory error is not recovered by hardware, searching a using process corresponding to the memory error address; determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
In another possible implementation manner, the memory error processing module 33 is further configured to: if the process type is a user space process, judging whether the using process is a key service process according to a preset process priority level; if the service process is not a key service process, the error log is recorded, a termination signal is sent to the using process, and the memory leaves corresponding to the using process are isolated from the line.
In another possible implementation manner, the memory error processing module 33 is further configured to: if the using process is a key service process, recording an error log, sending a termination signal to the using process, and isolating the memory leave corresponding to the using process from the line; the use process is restarted.
In another possible implementation manner, the memory error processing module 33 is further configured to: if the process type is the kernel process of the operating system, returning system error information to the user; the operating system is restarted.
In another possible implementation manner, the memory error processing module 33 is further configured to: if the process type is a virtual machine process, recording an error log; the error reporting information is sent to a virtual machine system corresponding to the virtual machine process, so that after the virtual machine system receives the error reporting information, the current system state information is obtained, and whether the memory error is recovered by hardware or not is judged according to the current system state information, so that a judgment result is obtained; determining a processing measure for the memory error according to the judging result and the error reporting information, so as to process the memory error through the processing measure until the memory error is processed.
The implementation principle and the technical effects of the memory error processing device provided by the embodiment of the present invention are the same as those of the foregoing memory error processing method embodiment, and for brevity description, reference may be made to corresponding contents in the foregoing memory error processing method embodiment where the embodiment of the memory error processing device is not mentioned.
The embodiment of the present invention further provides a server, as shown in fig. 4, which is a schematic structural diagram of the server, where the server includes a processor 41 and a memory 42, the memory 42 stores machine executable instructions that can be executed by the processor 41, and the processor 41 executes the machine executable instructions to implement the memory error handling method described above.
In the embodiment shown in fig. 4, the server further comprises a bus 43 and a communication interface 44, wherein the processor 41, the communication interface 44 and the memory 42 are connected by means of the bus.
The memory 42 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 44 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. The bus may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 4, but not only one bus or type of bus.
The processor 41 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 41 or by instructions in the form of software. The processor 41 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), and the like; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor 41 reads the information in the memory 42, and combines the hardware to complete the steps of the memory error processing method of the foregoing embodiment.
The embodiment of the invention also provides a machine-readable storage medium, which stores machine-executable instructions that, when being called and executed by a processor, cause the processor to implement the memory error processing method, and the specific implementation can be found in the foregoing method embodiment, which is not repeated herein.
The memory error processing method, the memory error processing device and the computer program product of the server provided by the embodiments of the present invention include a computer readable storage medium storing program codes, and the instructions included in the program codes may be used to execute the memory error processing method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments and will not be described herein.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (13)

1. A memory error handling method, comprising:
when error reporting information of a memory error is received, current system state information is obtained; the current system state information includes: various special state register information of the CPU and interrupt signals, wherein the register information comprises: the address of the error, whether the IP instruction register of the current CPU is valid or not and the error type of the current error;
judging whether the memory error is recovered by hardware according to the interrupt signal of the current system state information to obtain a judging result;
and determining a processing measure for the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure.
2. The memory error handling method of claim 1, wherein the error reporting information comprises a memory error address and a memory error type;
the step of determining the processing measure for the memory error according to the judging result and the error reporting information comprises the following steps:
if the memory error is recovered by hardware, determining a memory block to which the memory error address belongs;
determining the error times of the memory block according to the historical error log of the memory block;
and if the error times are greater than a preset times threshold, the memory block is disconnected, and an error log is recorded.
3. The memory error handling method of claim 2, further comprising:
and if the error times are smaller than a preset time threshold, recording an error log, wherein the content of the error log comprises the memory error address and the memory error type.
4. The memory error handling method of claim 1, wherein the error reporting information comprises a memory error address and a memory error type;
the step of determining the processing measure for the memory error according to the judging result and the error reporting information comprises the following steps:
if the memory error is not recovered by hardware, searching a using process corresponding to the memory error address;
determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
5. The memory error handling method according to claim 4, wherein the step of determining a handling measure for the memory error according to a process type of the using process includes:
if the process type is a user space process, judging whether the using process is a key service process according to a preset process priority level;
if the service process is not a key service process, recording an error log, sending a termination signal to the using process, and isolating the memory leave corresponding to the using process.
6. The memory error handling method of claim 5, further comprising:
if the using process is a key service process, recording an error log, sending a termination signal to the using process, and isolating the memory blade corresponding to the using process from the line;
restarting the using process.
7. The memory error handling method according to claim 4, wherein the step of determining a handling measure for the memory error according to a process type of the using process includes:
if the process type is the kernel process of the operating system, returning system error information to the user;
the operating system is restarted.
8. The memory error handling method according to claim 4, wherein the step of determining a handling measure for the memory error according to a process type of the using process includes:
if the process type is a virtual machine process, recording an error log;
the error reporting information is sent to a virtual machine system corresponding to the virtual machine process, so that after the virtual machine system receives the error reporting information, the current system state information is obtained, and whether the memory error is recovered by hardware or not is judged according to the current system state information, so that a judgment result is obtained; and determining a processing measure for the memory error according to the judging result and the error reporting information, so as to process the memory error through the processing measure until the memory error is processed.
9. A memory error handling apparatus, comprising:
the system state information acquisition module is used for acquiring current system state information when receiving error reporting information of a memory error; the current system state information includes: various special state register information of the CPU and interrupt signals, wherein the register information comprises: the address of the error, whether the IP instruction register of the current CPU is valid or not and the error type of the current error;
the judging module is used for judging whether the memory error is recovered by hardware according to the interrupt signal of the current system state information to obtain a judging result;
and the memory error processing module is used for determining a processing measure aiming at the memory error according to the judging result and the error reporting information so as to process the memory error through the processing measure.
10. The memory error handling apparatus of claim 9, wherein the error reporting information comprises a memory error address and a memory error type; the processing measure determining module is further configured to:
if the memory error is recovered by hardware, determining a memory block to which the memory error address belongs;
determining the error times of the memory block according to the historical error log of the memory block;
and if the error times are greater than a preset times threshold, the memory block is disconnected, and an error log is recorded.
11. The memory error handling apparatus of claim 10, wherein the error reporting information comprises a memory error address and a memory error type; the processing measure determining module is further configured to:
if the memory error is not recovered by hardware, searching a using process corresponding to the memory error address;
determining a processing measure for the memory error according to the process type of the using process; the process types comprise a user space process, an operating system kernel process and a virtual machine process.
12. A server comprising a processor and a memory, the memory storing computer executable instructions executable by the processor, the processor executing the computer executable instructions to implement the memory error handling method of any of claims 1 to 8.
13. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the memory error handling method of any one of claims 1 to 8.
CN202010464731.7A 2020-05-27 2020-05-27 Memory error processing method, device and server Active CN111625387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010464731.7A CN111625387B (en) 2020-05-27 2020-05-27 Memory error processing method, device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010464731.7A CN111625387B (en) 2020-05-27 2020-05-27 Memory error processing method, device and server

Publications (2)

Publication Number Publication Date
CN111625387A CN111625387A (en) 2020-09-04
CN111625387B true CN111625387B (en) 2024-03-29

Family

ID=72272399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010464731.7A Active CN111625387B (en) 2020-05-27 2020-05-27 Memory error processing method, device and server

Country Status (1)

Country Link
CN (1) CN111625387B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064745B (en) * 2021-02-20 2022-09-20 山东英信计算机技术有限公司 Method, device and medium for reporting error information
US11385974B1 (en) 2021-03-01 2022-07-12 Google Llc Uncorrectable memory error recovery for virtual machine hosts
CN115904642A (en) * 2021-08-19 2023-04-04 北京字节跳动网络技术有限公司 Cloud server control method and device, storage medium and electronic equipment
CN114518972A (en) * 2022-02-14 2022-05-20 海光信息技术股份有限公司 Memory error processing method and device, memory controller and processor

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681909A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Server early-warning method based on memory errors
TW201346532A (en) * 2011-12-22 2013-11-16 Intel Corp Apparatus and method for detecting and recovering from instruction fetch errors
CN104486100A (en) * 2014-11-28 2015-04-01 华为技术有限公司 Device and method for treating faults
CN105868038A (en) * 2016-03-28 2016-08-17 联想(北京)有限公司 Memory error processing method and electronic equipment
CN106844082A (en) * 2017-01-18 2017-06-13 联想(北京)有限公司 Processor predictive failure analysis method and device
CN107516547A (en) * 2016-06-16 2017-12-26 中兴通讯股份有限公司 The processing method and processing device of internal memory hard error
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN109885521A (en) * 2019-02-28 2019-06-14 苏州浪潮智能科技有限公司 A kind of interruption processing method, system and electronic equipment and storage medium
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN111143104A (en) * 2019-12-29 2020-05-12 苏州浪潮智能科技有限公司 Memory exception processing method and system, electronic device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050015672A1 (en) * 2003-06-25 2005-01-20 Koichi Yamada Identifying affected program threads and enabling error containment and recovery
US10318368B2 (en) * 2016-05-31 2019-06-11 Intel Corporation Enabling error status and reporting in a machine check architecture
US10318455B2 (en) * 2017-07-19 2019-06-11 Dell Products, Lp System and method to correlate corrected machine check error storm events to specific machine check banks
US10761918B2 (en) * 2018-04-18 2020-09-01 International Business Machines Corporation Method to handle corrected memory errors on kernel text
US10922180B2 (en) * 2018-10-03 2021-02-16 International Business Machines Corporation Handling uncorrected memory errors inside a kernel text section through instruction block emulation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201346532A (en) * 2011-12-22 2013-11-16 Intel Corp Apparatus and method for detecting and recovering from instruction fetch errors
CN102681909A (en) * 2012-04-28 2012-09-19 浪潮电子信息产业股份有限公司 Server early-warning method based on memory errors
CN104486100A (en) * 2014-11-28 2015-04-01 华为技术有限公司 Device and method for treating faults
CN105868038A (en) * 2016-03-28 2016-08-17 联想(北京)有限公司 Memory error processing method and electronic equipment
CN107516547A (en) * 2016-06-16 2017-12-26 中兴通讯股份有限公司 The processing method and processing device of internal memory hard error
CN106844082A (en) * 2017-01-18 2017-06-13 联想(北京)有限公司 Processor predictive failure analysis method and device
CN108763005A (en) * 2018-05-30 2018-11-06 郑州云海信息技术有限公司 A kind of memory ECC failures error-reporting method and system
CN109885521A (en) * 2019-02-28 2019-06-14 苏州浪潮智能科技有限公司 A kind of interruption processing method, system and electronic equipment and storage medium
CN110046061A (en) * 2019-03-01 2019-07-23 华为技术有限公司 EMS memory error treating method and apparatus
CN111143104A (en) * 2019-12-29 2020-05-12 苏州浪潮智能科技有限公司 Memory exception processing method and system, electronic device and storage medium

Also Published As

Publication number Publication date
CN111625387A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN111625387B (en) Memory error processing method, device and server
US6829729B2 (en) Method and system for fault isolation methodology for I/O unrecoverable, uncorrectable error
US7409594B2 (en) System and method to detect errors and predict potential failures
CN114579340A (en) Memory error processing method and device
US8671309B2 (en) Mechanism for advanced server machine check recovery and associated system software enhancements
JPH04338849A (en) Excessive error correction method
JP2002278848A (en) Method, apparatus, and program for cache threshold processing for predictively reporting fault of array bit line or driver
US9804917B2 (en) Notification of address range including non-correctable error
WO2012119410A1 (en) Method and device for detecting data reliability
US6845469B2 (en) Method for managing an uncorrectable, unrecoverable data error (UE) as the UE passes through a plurality of devices in a central electronics complex
US7574621B2 (en) Method and system for identifying and recovering a file damaged by a hard drive failure
JP3068009B2 (en) Error correction mechanism for redundant memory
CN111221775B (en) Processor, cache processing method and electronic equipment
US7353433B2 (en) Poisoned error signaling for proactive OS recovery
US10922180B2 (en) Handling uncorrected memory errors inside a kernel text section through instruction block emulation
CN111124729A (en) Fault disk determination method, device, equipment and computer readable storage medium
CN111506460A (en) Memory fault processing method and device, mobile terminal and storage medium
US11645156B1 (en) Updating error policy
Zhang et al. Software-Based Detecting and Recovering from ECC-Memory Faults
US20230123080A1 (en) Execute in place architecture with integrity check
CN115705261A (en) Memory fault repairing method, CPU, OS, BIOS and server
CN116893923A (en) Method, device, equipment and medium for processing problem of downtime caused by memory fault reporting
JP5381151B2 (en) Information processing apparatus, bus control circuit, bus control method, and bus control program
JPH06175901A (en) Processor for managing file
CN117950900A (en) Memory error processing method and computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant