CN116126581B - Memory fault processing method, device, system, equipment and storage medium - Google Patents

Memory fault processing method, device, system, equipment and storage medium Download PDF

Info

Publication number
CN116126581B
CN116126581B CN202310411587.4A CN202310411587A CN116126581B CN 116126581 B CN116126581 B CN 116126581B CN 202310411587 A CN202310411587 A CN 202310411587A CN 116126581 B CN116126581 B CN 116126581B
Authority
CN
China
Prior art keywords
fault
type
interrupt signal
memory
fault type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310411587.4A
Other languages
Chinese (zh)
Other versions
CN116126581A (en
Inventor
薛帅
黄明
崔毕轩
冯光辉
王宝林
宋卓
毛文安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202310411587.4A priority Critical patent/CN116126581B/en
Publication of CN116126581A publication Critical patent/CN116126581A/en
Application granted granted Critical
Publication of CN116126581B publication Critical patent/CN116126581B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

One or more embodiments of the present disclosure provide a method, an apparatus, a system, a device, and a storage medium for processing a memory failure. The method comprises the following steps: receiving an interrupt signal sent by a processor based on an uncorrectable error memory failure; determining whether the fault belongs to a first fault type or a second fault type in response to the interrupt signal, wherein the first fault type indicates an error which needs to be processed, and the second fault type indicates an error which can be processed optionally; if the first fault type is the first fault type, determining a memory page with uncorrectable errors according to fault information recorded by the processor or the memory controller, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine; and if the second fault type is the second fault type, not ending the process or the virtual machine accessing the memory page indicated by the fault information. The embodiment improves the accuracy of error processing.

Description

Memory fault processing method, device, system, equipment and storage medium
Technical Field
One or more embodiments of the present disclosure relate to the field of fault handling technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for processing a memory fault.
Background
In the related art, when a process or a virtual machine accesses a memory, a memory failure may be triggered, and the type of the triggered memory failure may be a memory failure of a Corrected Error (CE) class, or may be a memory failure of an uncorrectable Error (Uncorrected Error, UE) class. A common approach to handling a UE memory failure is to directly stop the running of the associated process or virtual machine to prevent further propagation of errors, but this may result in a false end of the process or virtual machine.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a method, apparatus, system, device, and storage medium for processing a memory failure.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present description, there is provided an interrupt signal sent by a receiving processor based on an uncorrectable erroneous memory failure;
determining whether the fault belongs to a first fault type or a second fault type in response to the interrupt signal, wherein the first fault type indicates an error which needs to be processed, and the second fault type indicates an error which can be processed optionally;
If the first fault type is the first fault type, determining a memory page with uncorrectable errors according to fault information recorded by the processor or the memory controller, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine;
and if the second fault type is the second fault type, not ending the process or the virtual machine accessing the memory page indicated by the fault information.
Optionally, the determining whether the current fault belongs to the first fault type or the second fault type includes:
determining whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal; if the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type; and if the type of the interrupt signal is asynchronous interrupt, determining that the fault belongs to a second fault type.
Optionally, if the first fault type is the first fault type, the method further includes:
after determining that the uncorrectable error memory page occurs, releasing the read-write mapping relation of the uncorrectable error memory page;
if the second fault type is the second fault type, the method further comprises:
and determining the memory page with the uncorrectable error according to the fault information recorded by the processor or the memory controller, and releasing the read-write mapping relation of the memory page.
Optionally, after receiving the interrupt signal, the firmware reports fault information recorded by the processor and the type of the interrupt signal to an operating system;
and determining whether the fault belongs to a first fault type or a second fault type by the operating system according to the type of the interrupt signal, and carrying out different processing based on different fault types.
Optionally, after receiving the interrupt signal, the firmware writes the fault information and the type of the interrupt signal into a target memory; after the writing is completed, calling an APEI interface to inform the operating system; the target memory indicates a shared memory which is agreed in advance by the firmware and the operating system;
and the operating system calls a callback function of the APEI interface after receiving the notification so as to acquire the fault information and the type of the interrupt signal from the target memory.
Optionally, the fault information and the type of the interrupt signal are written into the target memory by the firmware according to a preset data structure; the preset data structure indicates the firmware to write the fault information into the target memory according to a standard error interface format and write the type of the interrupt signal into the target memory according to a custom data format;
And the operating system analyzes the data written into the target memory according to the preset data structure to acquire the fault information and the type of the interrupt signal.
Optionally, the custom data format indicates a type of reporting the interrupt signal using an original data field in an APEI data structure;
wherein the original data field includes a header field and a body field;
the header field is used for describing verification information; the body field is used for describing the type of the interrupt signal;
and the operating system analyzes the verification information in the head part field according to the self-defined data format to verify, and acquires the type of the interrupt signal from the body part field after the verification is correct.
Optionally, after receiving the interrupt signal, the firmware determines whether the current fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and after determining the fault type of the current fault, the firmware reports the fault information recorded by the processor and the fault type of the current fault to an operating system;
different treatments are performed by the operating system based on different fault types.
According to a second aspect of one or more embodiments of the present disclosure, there is provided a memory failure processing apparatus, including:
the interrupt signal receiving module is used for receiving an interrupt signal sent by the processor based on the memory fault of uncorrectable errors;
a fault type determining module, configured to determine, in response to the interrupt signal, whether the current fault belongs to a first fault type or a second fault type, where the first fault type indicates an error that must be processed, and the second fault type indicates an error that can be processed optionally;
the fault processing module is used for determining the memory page with the uncorrectable error according to the fault information recorded by the processor or the memory controller if the first fault type is the first fault type, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine;
and the fault processing module is further configured to, if the fault is of the second fault type, not end a process or a virtual machine accessing the memory page indicated by the fault information.
According to a third aspect of one or more embodiments of the present specification, there is provided a resource scheduling system comprising:
a target resource node end scheduler corresponding to any target resource node in the resource node cluster, configured to execute the steps of the method according to any one of the first aspects;
A central scheduler corresponding to the cluster of resource nodes for performing the steps of the method according to any of the first aspects.
According to a fourth aspect of one or more embodiments of the present specification, there is provided an electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any of the first aspects by executing the executable instructions.
According to a fifth aspect of one or more embodiments of the present description, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method according to any of the first aspects.
The embodiment of the specification provides a memory fault processing method, which can receive an interrupt signal sent by a processor based on uncorrectable error memory faults; then, responding to the interrupt signal, and determining whether the fault belongs to a first fault type or a second fault type; if the first fault type is the first fault type, determining a memory page with uncorrectable errors according to fault information recorded by the processor or the memory controller, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine; and if the second fault type is the second fault type, not ending the process or the virtual machine accessing the memory page indicated by the fault information. According to the embodiment, when the memory faults of uncorrectable errors occur, whether the faults are faults which need to be processed or faults which need to be processed optionally can be further distinguished, and further, different processes are performed based on different fault types, so that the accuracy of error processing is improved, and the occurrence of the situation that a process or a virtual machine ends by mistake is avoided or reduced.
Drawings
Fig. 1 is a flowchart of a memory failure processing method according to an exemplary embodiment.
FIG. 2 is a schematic diagram of the interaction of a processor, firmware and an operating system in the related art.
FIG. 3 is a schematic diagram of the interaction of a processor, firmware and an operating system provided by an exemplary embodiment.
FIG. 4 is another interactive schematic diagram of a processor, firmware, and operating system provided by an exemplary embodiment.
Fig. 5 is a schematic diagram of an apparatus according to an exemplary embodiment.
Fig. 6 is a block diagram of a memory failure handling apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
Memory failure is a major cause of electronic device failure, and one of the storage modes of the memory system is the error correction code (ECC, error correcting code) storage mode. By generating an ECC code for actual data and storing it in an additional DRAM (dynamic random access memory ), the memory controller can correct single bit errors and detect two bit errors of data received from the DRAM. Single bit errors are referred to as CE (Correctable Error, correctable errors) errors, and two bit errors are referred to as UE (uncorrecttable errors, uncorrectable errors) errors.
In general, UE (uncorrecttable errors, uncorrectable error) errors can be further divided into:
(1) AR (Action Required), error: when an instruction in process running initiates a memory read-write request, namely an Execution Path consumes UE errors, and a processor triggers synchronization exception, an operating system must immediately process the errors, so that UE error propagation and further trigger downtime are prevented.
(2) AO (Action Optional) error: the UE error is discovered by the processor background (e.g. the memory controller discovers the UE error triggers the asynchronous interrupt; the CPU Prefetch instruction accesses the UE error triggers the asynchronous interrupt), and the UE error does not propagate regardless of the running context, and the operating system is optional for handling it.
The inventor finds that when the processor in the related art reports the UE error, it does not distinguish whether the UE error is an AR error or an AO error, if the Operating System (OS) processes the UE error as the AO error, the OS may crash after triggering a plurality of exceptions, so, in order to avoid the above situation, a conservative policy may be adopted, where the AR error and the AO error are both processed as the AR error, and the operation of the related process or the virtual machine is directly stopped, but this may cause the erroneous end of a part of the processes or the virtual machine.
Based on the problems in the related art, the embodiments of the present disclosure provide a memory fault processing method, which may receive an interrupt signal sent by a processor based on an uncorrectable error memory fault; then, in response to the interrupt signal, determining whether the fault belongs to a first fault type or a second fault type, wherein the first fault type indicates an error which needs to be processed, and the second fault type indicates an error which can be processed optionally; if the first fault type is the first fault type, determining a memory page with uncorrectable errors according to fault information recorded by the processor or the memory controller, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine; and if the second fault type is the second fault type, not ending the process or the virtual machine accessing the memory page indicated by the fault information. According to the embodiment, when the memory faults with uncorrectable errors occur, whether the faults are faults which need to be processed (AR faults) or faults which need to be processed optionally (AO faults) can be further distinguished, and further different processes are performed based on different fault types, so that the accuracy of error processing is improved, and the occurrence of the situation that a process or a virtual machine ends by mistake is avoided or reduced.
The memory failure processing methods provided by embodiments of the present application may be performed by electronic devices including, but not limited to, smart phones/handsets, tablet computers, personal Digital Assistants (PDAs), laptop computers, desktop computers, media content players, video game stations/systems, virtual reality systems, augmented reality systems, wearable devices (e.g., watches, glasses, gloves, headwear (e.g., hats, helmets, virtual reality headphones, augmented reality headphones, head Mounted Devices (HMDs), headbands), pendants, armband, leg rings, shoes, waistcoats), remote controls, or any other device with computing capabilities.
The electronic device includes a processor and a memory, where the memory stores executable instructions that can be executed on the processor, and the processor implements the memory failure processing method provided by the embodiment of the present application when executing the executable instructions.
The electronic device integrates a computer program product, and the electronic device realizes the memory fault processing method provided by the embodiment of the application when executing the computer program product.
In some embodiments, referring to fig. 1, fig. 1 is a flow chart illustrating a memory failure processing method. The method may be performed by an electronic device, the method comprising:
In S101, the receiving processor sends an interrupt signal based on an uncorrectable error memory failure.
In S102, in response to the interrupt signal, it is determined whether the current fault belongs to a first fault type or a second fault type, wherein the first fault type indicates an error that must be handled, and the second fault type indicates an error that is optionally handled.
In S103, if the first failure type is the first failure type, determining that the uncorrectable error occurs in the memory page according to the failure information recorded by the processor or the memory controller, searching for a process or a virtual machine accessing the memory page, and ending the process or the virtual machine.
In S104, if the second failure type is the second failure type, the process or the virtual machine accessing the memory page indicated by the failure information is not terminated.
In this embodiment, when an uncorrectable error occurs in the memory fault, whether the error is an error (AR error) that must be processed or an error (AO error) that may be processed optionally may be further distinguished, so that different processes are performed based on different fault types, thereby improving the accuracy of error processing, and avoiding or reducing occurrence of a situation that a process or a virtual machine ends by mistake.
In some embodiments, after receiving an interrupt signal sent by a processor based on an uncorrectable error memory fault, the electronic device may determine, according to a type of the interrupt signal, whether the fault belongs to a first fault type or a second fault type; if the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type; and if the type of the interrupt signal is asynchronous interrupt, determining that the fault belongs to a second fault type. If the fault is determined to be of the first fault type, the electronic equipment determines the memory page with the uncorrectable error according to the fault information recorded by the processor, searches a process or a virtual machine accessing the memory page, and ends the process or the virtual machine, so that the propagation of the uncorrectable error is avoided, and the accuracy of error processing is improved.
If the fault is determined to be of the second fault type, the electronic equipment does not need to end the process or the virtual machine accessing the memory page indicated by the fault information, so that the accuracy of error processing is improved, and the situation that the process or the virtual machine ends by mistake is avoided or reduced.
If the fault is determined to be of the first fault type, the electronic device determines that the memory page with the uncorrectable error occurs according to the fault information recorded by the processor or the memory controller, and removes the read-write mapping relation of the memory page with the uncorrectable error; and searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine.
If the fault is determined to be of the second fault type, the electronic device determines that the memory page with the uncorrectable error occurs according to the fault information recorded by the processor or the memory controller, and removes the read-write mapping relationship of the memory page with the uncorrectable error, so that the process or the virtual machine cannot access the memory page; there is no need to end the process or virtual machine accessing the memory page. And if the process or the virtual machine accesses the memory page indicated by the fault information, triggering a page fault abnormality, determining that an error which needs to be processed occurs based on the synchronous interrupt signal (namely, the first fault type), and processing according to the error flow which needs to be processed, namely, the electronic equipment can end the process or the virtual machine.
In some embodiments, the electronic device includes firmware and an operating system. Firmware (Firmware) is a program written in EPROM (erasable programmable read only memory) or EEPROM (electrically erasable programmable read only memory). The firmware refers to a device "driver" stored in the device, through which an operating system can implement operation of a specific machine according to a standard device driver, for example, an optical drive, a recorder, and the like all have internal firmware. Firmware is software that serves as the bottommost layer of the system's most basic. An operating system is a set of interrelated system software programs that host and control the operation, deployment, and execution of computer hardware, software resources, and provide common services to organize user interactions.
In the electronic equipment, the discovery, reporting, processing and recovery of hardware (such as a processor) errors are supported, so that the hardware errors are recovered under the cooperation of the hardware, the firmware and an operating system, and the normal operation of the electronic equipment is ensured. The hardware, firmware, and operating system find, report, process, and recover errors from bottom to top. Wherein the hardware (e.g., processor) is responsible for finding and recording errors, the firmware is responsible for error collection and reporting, and the operating system is responsible for error handling and recovery.
Hardware reports hardware errors to upper software in a variety of ways, some communicate error messages over the PCI-E (peripheral component interconnect express, high speed serial computer expansion bus standard) bus, some require reading and writing specific register sets to get error information, and others report error status by generating specific interrupts or exceptions. Behind these various approaches, it is the hardware designer and software developer that expend a significant amount of time defining interfaces and interface implementations. A direct consequence of this is that too much unnecessary overhead is added.
APEI (ACPI Platform Error Interface ) specifications unify interfaces between software and hardware, and reduce development complexity of software and hardware developers. Moreover, the APEI interface is more flexible and convenient to expand. For example, the specification definition of the APEI uses a lot of the structure definition existing in UEFI (unified extensible firmware interface ), which greatly improves the compatibility of the APEI.
The hardware for finding and recording the error can be a processor or a memory controller; including but not limited to a central processing unit (Central Processing Unit, CPU), digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA), etc. Memory controllers are an important component of computer systems that internally control memory and are responsible for the exchange of data between memory and the CPU.
In the related art, referring to fig. 2, when the processor finds an uncorrectable error memory failure, the processor may record failure information corresponding to the memory failure (201), and send an interrupt signal to the firmware (202); the firmware responds to the interrupt signal and reports fault information recorded by the processor to an operating system (203); the operating system directly ends the process or virtual machine accessing the memory page indicated by the fault information (203). In the related art, when a processor reports an uncorrectable error memory fault (UE error), it does not distinguish whether the error is an AR error or an AO error, but an operating system usually adopts a conservative policy to directly end a process or a virtual machine accessing a memory page where the UE error occurs, which may cause the process or the virtual machine to end erroneously.
To solve the above problem, in one possible implementation manner of the present disclosure, referring to fig. 3, when the processor finds an uncorrectable error memory failure, the processor may record the failure information corresponding to the memory failure (301), and send an interrupt signal to the firmware (302); after receiving an interrupt signal sent by a processor based on an uncorrectable error memory fault, the firmware reports fault information recorded by the processor and the type of the interrupt signal to an operating system (303); in one example, the types of Interrupt signals include, but are not limited to, synchronous exception (Synchronous exception), asynchronous Interrupt (Interrupt), and asynchronous exception (SError).
The operating system determines whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal; the first fault type indicates errors that have to be handled (AR errors) and the second fault type indicates errors that can be handled alternatively (AO errors). If the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type, determining that the uncorrectable error memory page exists by an operating system according to fault information recorded by the processor or a memory controller, removing a read-write mapping relation of the uncorrectable error memory page, and searching a process or a virtual machine accessing the memory page, thereby ending the process or the virtual machine; if the type of the interrupt signal is asynchronous interrupt, and the fault belongs to a second fault type, the operating system determines that the memory page of the uncorrectable error occurs according to the fault information recorded by the processor or the memory controller, and removes the read-write mapping relation of the memory page with uncorrectable error, but does not need to end the process or the virtual machine (304) accessing the memory page indicated by the fault information.
For example, referring to FIG. 4, the processor may send an interrupt signal (e.g., a synchronous interrupt signal or an asynchronous interrupt signal) to the firmware when it finds an uncorrectable erroneous memory failure; after receiving the interrupt signal, the firmware writes the fault information and the type of the interrupt signal into a target memory; after the writing is completed, calling an APEI interface to inform the operating system; the target memory indicates a shared memory which is agreed in advance by the firmware and the operating system. Illustratively, the APEI specification provides a number of notification means for notifying the operating system. The notification method is not limited in this embodiment, and the operating system is notified by GPIO (General-purpose input/output), SCI (Serial Communication Interface ), SDEI (Software Delegated Exception Interface, software delegated exception interface), or the like.
After receiving the notification, the operating system calls a callback function of the APEI interface to acquire the fault information and the type of the interrupt signal from the target memory, further determines whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, performs different processing based on different fault types, and if the fault is an AR error, performs isolation processing on a memory page with the AR error, namely, releases the read-write mapping relation of the memory page, and sends a SIGBUS signal to a related process or a virtual machine to stop the operation of the related process or the virtual machine; if the memory page is an AO error, the memory page with the AO error is isolated, and the related process or the running of the virtual machine is not required to be stopped (not shown in FIG. 4).
For example, when the firmware writes the relevant fault data into the target memory, the firmware may write the fault information and the type of the interrupt signal into the target memory according to a preset data structure. The operating system may also parse the data written into the target memory according to the preset data structure to obtain the fault information and the type of the interrupt signal.
In one possible implementation, the preset data structure instructs the firmware to write the fault information into the target memory according to a standard error interface format (Common Platform Error Record) defined in the UEFI (unified extensible firmware interface ) specification, and to write the type of interrupt signal into the target memory according to a custom data format. Illustratively, the custom data format indicates the type of interrupt signal that is reported using the raw data (raw data) field in the APEI data structure, allowing the operating system to distinguish between AR errors and AO errors.
Wherein the original data field includes a header field and a body field; the header field is used to describe metadata, such as metadata including verification information; the body field is used to describe the type of the interrupt signal. And the operating system analyzes the verification information in the header field according to the custom data format to verify, and acquires the type of the interrupt signal from the body field after the verification is correct.
In some application scenarios, besides reporting error information related to a memory fault by using the original data field, other types of error information may be reported by using the original data field, and the metadata described by the header field may further include fault types, for example, but not limited to, a fault type including a memory fault, a processor fault, a bus fault, and the like. And the operating system analyzes the verification information in the header field according to the custom data format to verify, and obtains the type of the interrupt signal from the body field under the conditions that the verification is correct and the fault type is a memory fault.
In another possible implementation manner of the present disclosure, when the processor finds an uncorrectable error memory failure, the processor may record failure information corresponding to the memory failure and send an interrupt signal to the firmware; after receiving an interrupt signal sent by a processor based on an uncorrectable error memory fault, determining whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and reporting fault information recorded by the processor and the fault type of the fault to an operating system after determining the fault type of the fault. Different treatments are performed by the operating system based on different fault types. If the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type, determining that the uncorrectable error memory page exists by the operating system according to fault information recorded by the processor, removing the read-write mapping relation of the uncorrectable error memory page, and searching a process or a virtual machine accessing the memory page, thereby ending the process or the virtual machine; if the type of the interrupt signal is asynchronous interrupt, and the fault belongs to a second fault type, the operating system can determine the memory page with uncorrectable errors according to the fault information recorded by the processor, and perform isolation processing on the memory page.
The various technical features of the above embodiments may be arbitrarily combined as long as there is no conflict or contradiction between the features, but are not described in detail, and therefore, the arbitrary combination of the various technical features of the above embodiments is also within the scope of the disclosure of the present specification.
Fig. 5 is a schematic block diagram of an apparatus according to an exemplary embodiment. Referring to fig. 5, at the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a nonvolatile memory 510, although other hardware may be included as needed for other services. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 502 reading a corresponding computer program from the non-volatile storage 510 into the memory 508 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.
Referring to fig. 6, the memory failure processing apparatus may be applied to the device shown in fig. 5 to implement the technical solution of the present specification. The memory failure processing apparatus may include:
An interrupt signal receiving module 601, configured to receive an interrupt signal sent by a processor based on an uncorrectable error memory failure;
a fault type determining module 602, configured to determine, in response to the interrupt signal, whether the current fault belongs to a first fault type or a second fault type, where the first fault type indicates an error that must be handled, and the second fault type indicates an error that can be handled optionally;
the fault handling module 603 is configured to determine, if the first fault type is the first fault type, that the uncorrectable error occurs in the memory page according to the fault information recorded by the processor or the memory controller, find a process or a virtual machine accessing the memory page, and end the process or the virtual machine;
the fault handling module 603 is further configured to, if the second fault type is the second fault type, not end a process or a virtual machine accessing the memory page indicated by the fault information.
In some embodiments, the fault type determining module 602 is specifically configured to determine, according to the type of the interrupt signal, whether the current fault belongs to the first fault type or the second fault type; if the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type; and if the type of the interrupt signal is asynchronous interrupt, determining that the fault belongs to a second fault type.
In some embodiments, the fault handling module 603 is further configured to, if the first fault type is the first fault type, remove the read-write mapping relationship of the memory page in which the uncorrectable error occurs after determining that the uncorrectable error occurs according to the fault information recorded by the processor or the memory controller; and if the second fault type is the second fault type, determining the memory page with the uncorrectable error according to the fault information recorded by the processor or the memory controller, and releasing the read-write mapping relation of the memory page.
In some embodiments, after receiving the interrupt signal, the firmware reports fault information recorded by the processor and the type of the interrupt signal to an operating system; and determining whether the fault belongs to a first fault type or a second fault type by the operating system according to the type of the interrupt signal, and carrying out different processing based on different fault types.
In some embodiments, after receiving the interrupt signal, the firmware writes the fault information and the type of the interrupt signal into a target memory; after the writing is completed, calling an APEI interface to inform the operating system; the target memory indicates a shared memory which is agreed in advance by the firmware and the operating system; and the operating system calls a callback function of the APEI interface after receiving the notification so as to acquire the fault information and the type of the interrupt signal from the target memory.
In some embodiments, the fault information and the type of the interrupt signal are written into the target memory by the firmware according to a preset data structure; the preset data structure indicates the firmware to write the fault information into the target memory according to a standard error interface format and write the type of the interrupt signal into the target memory according to a custom data format; and the operating system analyzes the data written into the target memory according to the preset data structure to acquire the fault information and the type of the interrupt signal.
In some embodiments, the custom data format indicates a type of reporting the interrupt signal using an original data field in an APEI data structure; wherein the original data field includes a header field and a body field; the header field is used for describing verification information; the body field is used for describing the type of the interrupt signal; and the operating system analyzes the verification information in the head part field according to the self-defined data format to verify, and acquires the type of the interrupt signal from the body part field after the verification is correct.
In some embodiments, after receiving the interrupt signal, the firmware determines whether the current fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and after determining the fault type of the current fault, reports the fault information recorded by the processor and the fault type of the current fault to an operating system; different treatments are performed by the operating system based on different fault types.
The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
In some embodiments, embodiments of the present disclosure further provide a resource scheduling system, including: a target resource node end scheduler corresponding to any target resource node in the resource node cluster, configured to execute the steps of any one of the methods described above; a central scheduler corresponding to the resource node cluster, configured to perform the steps of any one of the methods described above.
In some embodiments, embodiments of the present disclosure further provide an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor implements the method of any of the above by executing the executable instructions.
In some embodiments, the present description embodiments also provide a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method as described in any of the above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (11)

1. A memory failure handling method, comprising:
receiving an interrupt signal sent by a processor based on an uncorrectable error memory failure;
determining whether the fault belongs to a first fault type or a second fault type in response to the interrupt signal, wherein the first fault type indicates an error which needs to be processed, and the second fault type indicates an error which can be processed optionally;
if the first fault type is the first fault type, determining a memory page with uncorrectable errors according to fault information recorded by the processor or the memory controller, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine;
if the second fault type is the second fault type, not ending the process or the virtual machine accessing the memory page indicated by the fault information;
After receiving the interrupt signal, the firmware reports the type of the interrupt signal to an operating system according to a custom data format, wherein the custom data format indicates that the type of the interrupt signal is reported by using an original data field in an APEI data structure; the original data field comprises a header field and a body field; the header field is used for describing verification information; the body field is used for describing the type of the interrupt signal;
and analyzing the verification information in the header field by the operating system according to the custom data format for verification, acquiring the type of the interrupt signal from the body field after the verification is correct, determining whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and carrying out different processing based on different fault types.
2. The method of claim 1, the determining whether the current fault belongs to a first fault type or a second fault type, comprising:
determining whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal; if the type of the interrupt signal is synchronous interrupt, determining that the fault belongs to a first fault type; and if the type of the interrupt signal is asynchronous interrupt, determining that the fault belongs to a second fault type.
3. The method of claim 1, further comprising, if the first fault type:
after determining that the uncorrectable error memory page occurs, releasing the read-write mapping relation of the uncorrectable error memory page;
if the second fault type is the second fault type, the method further comprises:
and determining the memory page with the uncorrectable error according to the fault information recorded by the processor or the memory controller, and releasing the read-write mapping relation of the memory page.
4. A method according to any one of claims 1 to 3, wherein after receiving the interrupt signal, the firmware reports fault information recorded by the processor and the type of the interrupt signal to an operating system;
and determining whether the fault belongs to a first fault type or a second fault type by the operating system according to the type of the interrupt signal, and carrying out different processing based on different fault types.
5. The method of claim 4, wherein the firmware writes the fault information and the type of the interrupt signal into a target memory after receiving the interrupt signal; after the writing is completed, calling an APEI interface to inform the operating system; the target memory indicates a shared memory which is agreed in advance by the firmware and the operating system;
And the operating system calls a callback function of the APEI interface after receiving the notification so as to acquire the fault information and the type of the interrupt signal from the target memory.
6. The method of claim 5, wherein the fault information and the type of the interrupt signal are written into the target memory by the firmware according to a preset data structure; the preset data structure indicates the firmware to write the fault information into the target memory according to a standard error interface format and write the type of the interrupt signal into the target memory according to a custom data format;
and the operating system analyzes the data written into the target memory according to the preset data structure to acquire the fault information and the type of the interrupt signal.
7. A method according to any one of claims 1 to 3, wherein after receiving the interrupt signal, the firmware determines whether the current fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and after determining the fault type of the current fault, the firmware reports the fault information recorded by the processor and the fault type of the current fault to an operating system;
Different treatments are performed by the operating system based on different fault types.
8. A memory failure handling apparatus comprising:
the interrupt signal receiving module is used for receiving an interrupt signal sent by the processor based on the memory fault of uncorrectable errors;
a fault type determining module, configured to determine, in response to the interrupt signal, whether the current fault belongs to a first fault type or a second fault type, where the first fault type indicates an error that must be processed, and the second fault type indicates an error that can be processed optionally;
the fault processing module is used for determining the memory page with the uncorrectable error according to the fault information recorded by the processor or the memory controller if the first fault type is the first fault type, searching a process or a virtual machine accessing the memory page, and ending the process or the virtual machine;
the fault processing module is further configured to, if the fault is of the second fault type, not end a process or a virtual machine accessing the memory page indicated by the fault information;
after receiving the interrupt signal, the firmware reports the type of the interrupt signal to an operating system according to a custom data format, wherein the custom data format indicates that the type of the interrupt signal is reported by using an original data field in an APEI data structure; the original data field comprises a header field and a body field; the header field is used for describing verification information; the body field is used for describing the type of the interrupt signal;
And analyzing the verification information in the header field by the operating system according to the custom data format for verification, acquiring the type of the interrupt signal from the body field after the verification is correct, determining whether the fault belongs to a first fault type or a second fault type according to the type of the interrupt signal, and carrying out different processing based on different fault types.
9. A resource scheduling system, the resource scheduling system comprising:
a target resource node end scheduler corresponding to any target resource node in the cluster of resource nodes for performing the steps of the method of any of claims 1 to 7;
a central scheduler corresponding to said cluster of resource nodes for performing the steps of the method of any of claims 1 to 7.
10. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any one of claims 1 to 7 by executing the executable instructions.
11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 7.
CN202310411587.4A 2023-04-10 2023-04-10 Memory fault processing method, device, system, equipment and storage medium Active CN116126581B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310411587.4A CN116126581B (en) 2023-04-10 2023-04-10 Memory fault processing method, device, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310411587.4A CN116126581B (en) 2023-04-10 2023-04-10 Memory fault processing method, device, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116126581A CN116126581A (en) 2023-05-16
CN116126581B true CN116126581B (en) 2023-09-01

Family

ID=86299487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310411587.4A Active CN116126581B (en) 2023-04-10 2023-04-10 Memory fault processing method, device, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116126581B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116332B (en) * 2023-09-07 2024-05-24 上海合芯数字科技有限公司 Multi-bit error processing method, device, server and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6123931B1 (en) * 2016-03-15 2017-05-10 日本電気株式会社 Information processing apparatus, information processing method, and program
WO2022028209A1 (en) * 2020-08-05 2022-02-10 华为技术有限公司 Memory failure processing method and apparatus
CN114691409A (en) * 2022-04-18 2022-07-01 阿里巴巴(中国)有限公司 Memory fault processing method and device
CN115168088A (en) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 Method and device for repairing uncorrectable errors of memory
CN115421984A (en) * 2022-09-29 2022-12-02 深信服科技股份有限公司 Memory fault processing method and device, electronic equipment and medium
CN115421960A (en) * 2022-09-28 2022-12-02 深信服科技股份有限公司 UE memory fault recovery method, device, electronic equipment and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8201024B2 (en) * 2010-05-17 2012-06-12 Microsoft Corporation Managing memory faults

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6123931B1 (en) * 2016-03-15 2017-05-10 日本電気株式会社 Information processing apparatus, information processing method, and program
WO2022028209A1 (en) * 2020-08-05 2022-02-10 华为技术有限公司 Memory failure processing method and apparatus
CN114691409A (en) * 2022-04-18 2022-07-01 阿里巴巴(中国)有限公司 Memory fault processing method and device
CN115168088A (en) * 2022-07-08 2022-10-11 超聚变数字技术有限公司 Method and device for repairing uncorrectable errors of memory
CN115421960A (en) * 2022-09-28 2022-12-02 深信服科技股份有限公司 UE memory fault recovery method, device, electronic equipment and medium
CN115421984A (en) * 2022-09-29 2022-12-02 深信服科技股份有限公司 Memory fault processing method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN116126581A (en) 2023-05-16

Similar Documents

Publication Publication Date Title
US20160378579A1 (en) Atypical Reboot Data Collection And Analysis
CN110879742B (en) Method, device and storage medium for asynchronously creating internal snapshot by virtual machine
EP3816924B1 (en) Method for accepting blockchain evidence storage transaction and system
CN116126581B (en) Memory fault processing method, device, system, equipment and storage medium
CN107861691B (en) Load balancing method and device of multi-control storage system
US20120222051A1 (en) Shared resource access verification
US10514972B2 (en) Embedding forensic and triage data in memory dumps
CN104461730A (en) Virtual resource allocation method and device
US10908987B1 (en) Handling memory errors in computing systems
US20120144245A1 (en) Computing device and method for detecting pci system errors in the computing device
US8726101B2 (en) Apparatus and method for tracing memory access information
CN110688245A (en) Information acquisition method, device, storage medium and equipment
CN114691391A (en) Super-calling method and device for kernel mode program of enhanced packet filter
CN112905668B (en) Database derivative method, device and medium based on distributed data stream processing engine
US10635554B2 (en) System and method for BIOS to ensure UCNA errors are available for correlation
US11507413B2 (en) Tracking method, apparatus, device, and machine-readable medium
CN111435323B (en) Information transmission method, device, terminal, server and storage medium
WO2023071158A1 (en) Memory optimization method and apparatus, terminal, and storage medium
EP4350521A1 (en) Data processing method and related apparatus
US20190287607A1 (en) System and Method for Memory Fault Resiliency in a Server using Multi-Channel Dynamic Random Access Memory
CN113792299B (en) Method for protecting Linux system based on ftrace technology
CN115421960A (en) UE memory fault recovery method, device, electronic equipment and medium
CN102469474A (en) Method and device for processing abnormal information of communication equipment
CN115454570A (en) Disaster recovery method, virtual machine system, device, and storage medium
CN109271277B (en) Access method, device and system after database downtime

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant