CN115543666A - Method, apparatus and computer-readable storage medium for fault handling - Google Patents

Method, apparatus and computer-readable storage medium for fault handling Download PDF

Info

Publication number
CN115543666A
CN115543666A CN202110739085.5A CN202110739085A CN115543666A CN 115543666 A CN115543666 A CN 115543666A CN 202110739085 A CN202110739085 A CN 202110739085A CN 115543666 A CN115543666 A CN 115543666A
Authority
CN
China
Prior art keywords
fault
executed
firmware
information
sending
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110739085.5A
Other languages
Chinese (zh)
Inventor
张俊
仇连根
龚彬阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110739085.5A priority Critical patent/CN115543666A/en
Publication of CN115543666A publication Critical patent/CN115543666A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a fault handling method, equipment and a computer readable storage medium, belonging to the technical field of RAS fault tolerance. The method comprises the following steps: and acquiring fault information through firmware, and reporting the fault information to the OS. And determining a fault processing strategy to be executed by the OS according to the fault information. Sending, by the OS to the firmware, the response to be executed and executing commands of the fault handling strategy. And executing a corresponding fault processing strategy to be executed according to the execution command through the firmware. The firmware in the application is not limited to only one fault handling policy which can be encoded, but a plurality of fault handling policies can be encoded, and the OS decides which fault handling policy to use and instructs the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.

Description

Method, apparatus and computer readable storage medium for fault handling
Technical Field
The present application relates to the field of RAS fault tolerance technologies, and in particular, to a method, an apparatus, and a computer-readable storage medium for fault handling.
Background
The server needs to process data frequently, and as the running time increases, the server inevitably has faults, such as memory faults, processor faults and the like. Therefore, it is necessary to deploy an underlying fault tolerance technology (also called an underlying Reliability, availability, serviceability, RAS) technology) to the server to handle the failure.
Currently, the underlying RAS technology is mainly implemented by firmware of a server, and specifically, the implementation of the underlying RAS technology may include the following processes: and after the chip detects the fault, reporting the interrupt to the firmware. Then, the firmware collects the fault information, if the memory fault is judged according to the fault information, the hard-coded memory fault processing strategy is executed, and if the memory fault is judged to be the processor fault, the hard-coded processor fault processing strategy is executed.
In the underlying RAS technology, a corresponding fault handling policy is hard-coded in firmware for memory faults, processor faults, and the like. In this case, if the failure handling policy is to be replaced, it is necessary to upgrade the firmware that currently hard-codes the failure handling policy and restart the server. However, when a server is restarted, it may cause service interruption of the server.
Disclosure of Invention
The embodiment of the application provides a method, equipment and a computer readable storage medium for fault processing, which can solve the problem of service interruption caused by restarting a server when a fault processing strategy is adjusted in the related art, and the technical scheme is as follows:
in a first aspect, a method for fault handling is provided, where the method includes:
the failure information is acquired by firmware, and the acquired failure information is transmitted to an Operating System (OS). And the OS determines a fault processing strategy to be executed according to the received fault information, and sends an execution command corresponding to the fault processing strategy to be executed to the firmware. And after receiving an execution command corresponding to the fault processing strategy to be executed through the firmware, executing the fault processing strategy to be executed.
In the solution shown in the embodiment of the present application, the firmware is a Basic Input Output System (BIOS) fault information, which includes a fault location, a fault level, a fault type, and the like. For memory failures, the failure types include FIFO overflow, timeout, etc.
After detecting the fault, the hardware writes the fault information into the designated CPU register. The BIOS obtains the fault information in the designated CPU register. And the BIOS reports the acquired fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the policy to be executed. And after receiving the execution command corresponding to the fault processing strategy to be executed, the BIOS sends an execution notice corresponding to the fault processing strategy to be executed to the hardware. After receiving the execution notice corresponding to the fault processing strategy to be executed, the hardware calls a code corresponding to the fault processing strategy to be executed in an operator module of the BIOS so as to realize fault processing.
It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.
In a possible implementation manner, the sending of the fault information to an Advanced Configuration and Power Management Interface (ACPI) virtual device driver (ACPI virtual device driver) in the OS by using firmware may specifically be: the firmware sends the fault information to an ACPI virtual device driver in the OS through an ACPI Platform Error Interface (APEI).
In the solution shown in the embodiment of the present application, a fault reporting module in the BIOS calls the APEI to report fault information to a fault notification chain registered in a kernel space (kernel space) of the OS. And the fault notification chain notifies an ACPI virtual device driver in the kernel space of the OS to acquire fault information.
In one possible implementation, before determining the failure mode and the probability of the failure occurring an uncorrectable error according to the failure information by the OS, the failure information may be obtained as follows:
the ACPI virtual device driver stores the fault information in a target memory, and a device node controller in the OS queries the target memory according to a preset period to acquire the fault information.
In a possible implementation manner, the OS determines a to-be-executed fault handling policy according to the fault information, and sends an execution command corresponding to the to-be-executed fault handling policy to the firmware, where the specific processing may be:
and the equipment node controller in the OS determines a fault processing strategy to be executed according to the fault information, and sends an execution command corresponding to the fault processing strategy to be executed to the firmware.
In a possible implementation manner, the specific process of determining, by the device node controller, the to-be-executed fault handling policy according to the fault information may be:
and the equipment node controller determines a fault mode and the probability of the occurrence of the uncorrectable error of the fault according to the fault information. And determining a fault processing strategy to be executed according to the fault mode and the probability.
In the solution shown in the embodiment of the present application, the device node controller may include a collector, a diagnotor, and a decider.
The above-mentioned equipment node controller inquires the target memory according to the preset cycle, obtains the fault information, and the concrete processing is: and a diagnotor in the equipment node controller determines a fault mode and the probability of uncorrectable errors of the fault according to the intelligent diagnosis model and the fault information. The intelligent diagnosis model can be constructed according to machine learning algorithms such as a threshold value grading algorithm and a forest tree algorithm. Before the intelligent diagnosis model is used, the intelligent diagnosis model can be trained through a large number of samples in advance, and specifically, one group of samples can comprise fault information, a fault mode corresponding to the fault information and fault uncorrectable probability corresponding to the fault information.
Taking the failure information as the memory failure information as an example, the failure mode may include a row failure, a column failure, and the like.
The device node controller determines a fault handling strategy to be executed according to the fault mode and the probability, and the specific handling may be: and the diagnotor sends the obtained failure mode and the probability of the uncorrectable error of the failure to a decision maker, and the decision maker determines a failure processing strategy to be executed according to the failure mode and the probability of the uncorrectable error of the failure.
The decision-making device can determine the fault processing strategy to be executed corresponding to the current obtained fault mode and the probability of the uncorrectable error according to the corresponding relation of the fault mode, the probability of the uncorrectable error and the fault processing strategy, which are stored in advance.
Taking the failure information as the memory failure information as an example, the failure processing policy may include: the method comprises the steps of storm suppression setting of memory fault interruption, period setting of memory polling, mirror image (Mirror) replacement execution, memory Rank replacement execution, memory Bank replacement execution, memory particle replacement execution, an ACLS (ARM Cache Line Sparing) method for repairing hard failure of a memory unit, PPR and the like.
In a possible implementation manner, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the firmware, and the specific processing may be:
and the OS sends an execution command corresponding to the fault processing strategy to be executed to the ACPI virtual equipment.
In a possible implementation manner, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the ACPI virtual device, and the specific processing may be:
and the decision maker calls a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sends an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.
In the solution shown in the embodiment of the present application, the target interface belongs to an acpimam interface encapsulated in an ACPI virtual device driver. The ACPI DSM interface corresponds to each fault handling policy for memory faults.
In a second aspect, a fault handling apparatus is provided, which includes a processor and a memory, where a plurality of programs corresponding to an OS and firmware are stored, and the programs are read and executed by the processor to implement the fault handling method according to the first aspect.
In a third aspect, an apparatus for fault handling is provided, the apparatus comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring fault information through firmware and sending the fault information to an Operating System (OS);
a sending module, configured to determine, by the OS according to the fault information, a to-be-executed fault handling policy, and send an execution command corresponding to the to-be-executed fault handling policy to the firmware;
and the execution module is used for executing the fault handling strategy to be executed through the firmware.
In a possible implementation manner, the obtaining module is configured to:
and calling an error interface APEI of the ACPI platform through firmware to send the fault information to an ACPI virtual device driver in the OS.
In a possible implementation manner, the obtaining module is further configured to:
recording the fault information in a target memory through the ACPI virtual equipment driver;
and inquiring the target memory by the equipment node controller in the OS according to a preset period to acquire the fault information.
In a possible implementation manner, the sending module is configured to:
and determining a fault processing strategy to be executed by the equipment node controller according to the fault information, and sending an execution command corresponding to the fault processing strategy to be executed to the firmware.
In a possible implementation manner, the sending module is configured to:
and determining, by the device node controller, a failure mode and a probability of an uncorrectable error occurring in a failure according to the failure information, and determining a failure handling policy to be executed according to the failure mode and the probability.
In a possible implementation manner, the sending module is configured to:
and sending an execution command corresponding to the fault handling strategy to be executed to the ACPI virtual equipment.
In a possible implementation manner, the sending module is configured to:
and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.
In one possible implementation, the firmware is a basic input output system BIOS.
In a fourth aspect, a computer-readable storage medium is provided, in which a plurality of programs respectively corresponding to an OS and firmware are stored, and the programs are configured to be read and executed by a processor to implement the method for fault handling according to the first aspect.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
in the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.
Drawings
Fig. 1 is a schematic architecture diagram of a server provided in an embodiment of the present application;
fig. 2 is a diagram comparing a fault handling method architecture provided in an embodiment of the present application;
FIG. 3 is a diagram of a fault handling architecture provided by an embodiment of the present application;
fig. 4 is a flowchart of a method for fault handling according to an embodiment of the present application;
FIG. 5 is a diagram of a fault handling architecture provided by an embodiment of the present application;
fig. 6 is a flowchart of a method for fault handling according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.
Detailed Description
The embodiment of the application provides a fault processing method which can be applied to a server, a storage system, a computer and the like. In the method, an Operating System (OS) implements a selection decision of a fault handling policy, and instructs a firmware to execute the fault handling policy determined by the selection decision, thereby completing the underlying fault-tolerant handling of the fault.
Referring to fig. 1, an architecture diagram of a server provided in an embodiment of the present application is shown.
The server shown in fig. 1 includes a processor 110, a memory 120, a bridge 130, a storage Controller 140, a hard disk 150, a flash memory 160, a network card 170, a graphics card 180, and a Baseboard Management Controller (BMC) 190. Among other things, the processor 110 may detect server failures, such as memory failures, processor failures, and the like. The flash memory 160 may store a BIOS. The hard disk 150 may store an OS. The memory 120 may store failure information.
Processor 110 extends the various interfaces through bridge 130. For example, the flash memory is connected through a Serial Peripheral Interface (SPI) of the bridge chip 130. The BMC is connected through a bridge piece expansion high-speed serial computer expansion bus (PCIE) interface, an asynchronous serial port and the like. The PCIE interface is expanded by the bridge chip 130 to connect the network card. The BMC is connected with the management network port, and the network card is connected with the service network port. In addition, the processor 110 may provide a Universal Serial Bus (USB) through a bridge chip.
In order to more clearly understand the difference between the fault handling method provided by the embodiment of the present application and the underlying fault tolerance method of the fault in the related art, the following describes the fault handling method and the underlying fault tolerance method separately with reference to fig. 2.
The left diagram in fig. 2 shows the underlying fault-tolerant method of failure in the related art. In the left graph, the firmware collects the fault information, and then executes the designated fault handling policy hard-coded in the firmware to realize the underlying fault tolerance of the fault. The firmware may report the failure information to the OS, and the OS performs software-level failure soft processing based on the failure information.
The right diagram in fig. 2 shows that the embodiment of the present application provides a fault handling method. In the right diagram, the firmware collects the fault information and then reports the fault information to the OS. And after the OS acquires the fault information, deciding a corresponding fault processing strategy according to the fault information, and indicating the firmware to execute the fault processing strategy. In addition, after the OS acquires the fault information, it can perform soft processing of the fault in the software layer according to the fault information.
The following briefly describes a fault handling method provided in the embodiment of the present application with reference to a fault handling architecture diagram shown in fig. 3.
And after detecting the fault, the hardware (hardware) reports fault information to the firmware (firmware). Specifically, the fault may be a memory fault, a processor fault, or the like. The hardware may be a processor of an X86 architecture or an advanced reduced instruction set machine (advanced RISC machine, ARM) architecture.
The firmware reports the fault information to the OS through an APEI interface, and an adaptive RAS (adaptive RAS organization) in a user space (user space) of the OS acquires the fault information. The adaptive RAS management layer is only an exemplary name, and may also be referred to as a fault management module, and the like, and is substantially a software module, and a specific name of the adaptive RAS management layer is not limited in this embodiment.
Then, the self-adaptive RAS management layer determines a fault processing strategy according to the fault information, and sends an execution command corresponding to the determined fault processing strategy to an RAS node driver (RAS node driver) in a kernel space (kernel space) of the OS. The RAS node driver issues an execution command to a firmware node (FW node) in the firmware through an adaptation driver (adaptation driver). The adaptive driver plays a role in executing the command to be relayed, so that the relayed executing command can be received and analyzed by the firmware node.
The firmware node receives the execution command and instructs the hardware to call corresponding code in a FW RAS Core (FW RAS Core) to execute the fault handling policy.
It should be noted that the RAS node driver, the adaptive driver, the firmware node, and the FW RAS core are all exemplary names, and are software modules in nature, and may also have other names.
Fig. 4 shows a flowchart of a method for processing a fault according to an embodiment of the present application. Referring to fig. 4, the method may include the steps of:
step 301, the firmware obtains the failure information.
The firmware is Basic Input Output System (BIOS) fault information including a fault location, a fault level, a fault type, and the like. For memory failures, the failure types include FIFO overflow, timeout, etc.
In an implementation, when a Central Processing Unit (CPU) detects a failure in a memory, CPU, or the like, the CPU generates failure information and writes the failure information in a designated CPU register. Then, the CPU sends an interrupt signal to the firmware. And after receiving an interrupt signal sent by the CPU, the firmware acquires fault information from a specified CPU register.
In particular, see the fault handling architecture diagram shown in fig. 5. The processing of the firmware in acquiring the failure information may be as follows:
an error report (error report) module in the firmware obtains the fault information from the specified CPU registers.
It should be noted that the CPU belongs to a part of the hardware in fig. 5, and specifically, the CPU may be a processor in an X86 architecture or a processor in an ARM architecture.
Step 302, the firmware sends failure information to the OS.
In an implementation, the firmware may report the failure information to the OS after acquiring the failure information.
In particular, see the fault handling architecture diagram shown in fig. 5. The process of reporting the fault information to the OS by the firmware may be as follows:
the fault reporting module calls the APEI to report the fault information to a fault notification chain (i.e., APEI driver in fig. 5) registered in a kernel space (kernel space) of the OS.
Then, the fault notification chain notifies an ACPI virtual device driver (ACPI virtual device driver) in the kernel space of the OS to acquire the fault information.
The ACPI virtual device driver acquires the fault information and records the fault information in the specified memory. And polling the specified memory by a device node controller (device node controller) in a user space (user space) of the OS to acquire fault information.
The plant node controller may include a collector, a diagnostician, and a decision maker. The process of polling the specified memory by the device node controller to obtain the fault information may specifically be: and the collector polls the appointed memory to obtain the fault information.
Note that the ACPI virtual device driver in fig. 5 corresponds to a combination of the RAS node driver and the adaptation driver in fig. 3. The device node controller in FIG. 5 is equivalent to adaptive RAS organization in FIG. 3.
The following description of the ACPI virtual device driver and the chain of fault notification is given:
before step 301 is executed, an ACPI virtual device (ACPI virtual device) may be integrated into the firmware and reported to the OS. Further, the kernel space of the OS may create an ACPI virtual device driver for the ACPI virtual device. Then, the ACPI virtual device driver registers a failure notification chain (notification chain).
Wherein, ACPI virtual device driver at least comprises the following functions:
register the failure notification chain, encapsulate an ACPI Device Specific Method (acpidms) interface, and provide the interface to the Device node controller.
Step 303, the OS determines a to-be-executed fault handling policy according to the fault information.
In implementation, the collector in the device node controller sends fault information to the diagnostor, and the diagnostor determines the corresponding fault mode and the probability of uncorrectable errors of the fault according to the intelligent diagnosis model.
Taking the failure information as the memory failure information as an example, the failure mode may include a row failure, a column failure, and the like.
Taking the fault information as the CPU fault information as an example, the fault mode may include a core internal high speed memory (Cache) failure, a logic execution unit fault, and the like.
It should be noted that the intelligent diagnosis model may be constructed according to a threshold classification algorithm, a forest tree algorithm, and other machine learning algorithms. Before use, the intelligent diagnosis model may be trained in advance through a large number of samples, and specifically, one group of samples may include fault information, a fault mode corresponding to the fault information, and a fault uncorrectable probability corresponding to the fault information.
And the diagnotor sends the obtained failure mode and the probability of the uncorrectable error of the failure to a decision maker, and the decision maker determines a failure processing strategy to be executed according to the failure mode and the probability of the uncorrectable error of the failure.
For example, the decision-making device may determine the to-be-executed fault handling policy corresponding to the currently obtained fault mode and the probability of the uncorrectable error occurring in the fault according to the pre-stored corresponding relationship between the fault mode, the probability of the uncorrectable error occurring in the fault, and the fault handling policy. Taking a memory fault as an example, when the fault mode is a ROW fault and the probability of the fault generating an uncorrectable error is greater than a preset threshold, it is determined that the fault handling policy to be executed is a PPR (Post Package report) method, which is a method for repairing a memory ROW error.
Taking the failure information as the memory failure information as an example, the failure processing policy may include: the method comprises the steps of storm suppression setting of memory fault interruption, period setting of memory polling, mirror image (Mirror) replacement execution, memory Rank replacement execution, memory Bank replacement execution, memory particle replacement execution, an ACLS (ARM Cache Line Sparing) method for repairing hard failure of a memory unit, PPR and the like.
Step 304, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the firmware.
In implementation, the decision maker calls a target interface corresponding to the fault handling policy to be executed and packaged in the ACPI virtual device driver, and sends an execution command to the ACPI virtual device through the target interface. The target interface belongs to an ACPI DSM interface packaged in an ACPI virtual device driver.
As shown in table 1 below, the ACPI DSM interface corresponds to each failure handling policy for memory failures.
TABLE 1
Figure BDA0003142488690000071
Step 305, the firmware executes the to-be-executed fault handling policy.
In implementation, after receiving an execution command corresponding to a to-be-executed fault handling policy, the ACPI virtual device sends an execution notification corresponding to the to-be-executed fault handling policy to the CPU. After receiving the execution notification corresponding to the to-be-executed fault processing strategy, the CPU calls a code corresponding to the to-be-executed fault processing strategy in an operator module of the firmware to realize fault processing.
Specifically, if the CPU is a processor of an X86 architecture, the ACPI virtual device may notify the execution of the to-be-executed fault handling policy through a System Management Interrupt (SMI).
If the CPU is an ARM architecture processor, the ACPI virtual device may notify the execution of the fault handling policy to be executed through a Serial Peripheral Interface (SPI) or a System Control and Management Interface (SCMI).
In the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.
The following describes a processing flow of the method for fault handling in the fault handling architecture shown in fig. 5, with reference to a flow chart of the method for fault handling shown in fig. 6.
Step 501, detecting the fault by the hardware, acquiring fault information, and writing the fault information into a designated CPU register.
Step 502, the hardware sends an interrupt signal to the firmware.
In an implementation, the hardware may send an interrupt signal to the firmware after writing the fault information to the designated CPU register.
Step 503, the firmware reads the fault information in the designated CPU register.
In implementation, after receiving an interrupt signal sent by hardware, the firmware reads fault information in a designated CPU register.
Step 504, the firmware calls the APEI to send the fault information to the APEI driver.
In implementation, after reading the fault information in the designated CPU register, the firmware calls the APEI to send the fault information to the APEI driver.
And 505, the APEI driver informs the ACPI virtual device driver to acquire the fault information.
In implementation, after receiving the failure information, the APEI driver notifies the ACPI virtual device driver to acquire the failure information.
Step 506, the ACPI virtual device driver records the fault information in the designated memory.
In implementation, after acquiring the fault information, the ACPI virtual device driver writes the fault information into the specified memory.
And step 507, polling the specified memory by the collector in the device node controller to acquire the fault information.
In implementation, the collector in the device node controller queries the specified memory according to a preset period, and the collector can acquire the fault information in a certain period under the condition that the fault information is written in the specified memory.
Step 508, the collector sends the fault information to the diagnotor in the device node controller.
After the collector acquires the fault information, the collector sends the fault information to the diagnostor in the device node controller.
In step 509, the diagnotor determines the corresponding failure mode and the probability of the uncorrectable error occurring in the failure according to the failure information.
In implementation, after the diagnotor obtains the fault information, the fault information is input into the intelligent diagnosis model, and the intelligent diagnosis model outputs the corresponding fault mode and the probability of the uncorrectable error of the fault.
Step 510, the diagnotor sends the failure mode and the probability of the failure occurring an uncorrectable error to a decision maker in the device node controller.
In implementation, after obtaining the failure mode and the probability of the uncorrectable error, the diagnotor sends the failure mode and the probability of the uncorrectable error to the decision maker in the device node controller.
Step 511, the decision-making device determines the fault handling strategy to be executed according to the fault mode and the probability of the uncorrectable error of the fault.
In implementation, the decision-making device determines the currently obtained fault mode and the to-be-executed fault processing strategy corresponding to the probability of the uncorrectable error according to the pre-stored corresponding relationship among the fault mode, the probability of the uncorrectable error and the fault processing strategy.
Step 512, the decision maker calls a target interface corresponding to the to-be-executed fault handling policy packaged in the ACPI virtual device driver.
In implementation, after determining the to-be-executed fault processing policy, the decision maker calls a target interface corresponding to the to-be-executed fault processing policy packaged in the ACPI virtual device driver.
Step 513, the ACPI virtual device driver sends an execution command to the ACPI virtual device through the target interface.
And 514, the ACPI virtual device sends an execution notification corresponding to the to-be-executed fault handling policy to the hardware.
In implementation, if the hardware is a processor of an X86 architecture, the ACPI virtual device may notify the execution of the pending failover policy via an SMI.
If the hardware is a processor of an ARM architecture, the ACPI virtual device may notify the execution of the fault handling policy to be executed through the SPI or SCMI.
Step 515, the hardware calls a code corresponding to the to-be-executed fault handling policy in the operator module of the firmware, so as to implement fault handling.
It should be noted that the specific processing of each module in steps 501 to 515 is the same as the specific processing of the corresponding module in steps 301 to 305, and is not described herein again.
In the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that, in the present application, the firmware is no longer limited to encode only one fault handling policy, but may encode multiple fault handling policies, and the OS decides which fault handling policy to use and instructs the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.
Based on the same technical concept, an embodiment of the present application further provides a fault handling apparatus, as shown in fig. 7, the apparatus includes:
an obtaining module 710, configured to obtain fault information through firmware and send the fault information to an operating system OS, where the obtaining and sending functions in step 301 and step 302 and other implicit steps may be specifically implemented;
a sending module 720, configured to determine, by the OS according to the fault information, a to-be-executed fault handling policy, and send, to the firmware, an execution command corresponding to the to-be-executed fault handling policy, where the functions of determining and sending in step 303 and step 304, and other implicit steps may be specifically implemented;
the executing module 730 is configured to execute the to-be-executed fault handling policy through the firmware, and may specifically implement the executing function in step 305 described above and other implicit steps.
In a possible implementation manner, the obtaining module is configured to:
and calling an error interface APEI of the ACPI platform through firmware to send the fault information to an ACPI virtual device driver in the OS.
In a possible implementation manner, the obtaining module 710 is further configured to:
recording the fault information in a target memory through the ACPI virtual equipment driver;
and inquiring the target memory by the equipment node controller in the OS according to a preset period to acquire the fault information.
In a possible implementation manner, the sending module 720 is configured to:
and determining a fault processing strategy to be executed by the equipment node controller according to the fault information, and sending an execution command corresponding to the fault processing strategy to be executed to the firmware.
In a possible implementation manner, the sending module 720 is configured to:
and determining, by the device node controller, a failure mode and a probability of an uncorrectable error occurring in a failure according to the failure information, and determining a failure handling policy to be executed according to the failure mode and the probability.
In a possible implementation manner, the sending module 720 is configured to:
and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment.
In a possible implementation manner, the sending module 720 is configured to:
and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.
In one possible implementation, the firmware is a basic input output system BIOS.
In the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.
It should be noted that: in the apparatus for processing a fault provided in the foregoing embodiment, only the division of each functional module is illustrated when performing fault processing, and in practical applications, the above function allocation may be completed by different functional modules as needed, that is, the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for fault handling and the method for fault handling provided by the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and will not be described herein again.
Referring to fig. 8, an embodiment of the application provides a schematic diagram of an apparatus 600. The device 600 may be a computer, server, etc. The device 600 comprises at least a processor 601, an internal connection 602, a memory 603.
In a possible implementation manner, the processor 601 may be a general processing unit (CPU), a Network Processor (NP), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program according to the present disclosure.
The internal connections 602 may include a path for passing information between the components. Optionally, the internal connection 602 is a single board or a bus.
The memory 603 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disc storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integrated with the processor.
The memory 603 is used for storing program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603, thereby causing the apparatus 600 to implement the functions of the present application.
In particular implementations, processor 601 may include one or more CPUs, such as CPU0 and CPU1 in fig. 8, as one embodiment.
In particular implementations, the device 600 may include multiple processors, as one embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any combination thereof, and when the implementation is realized by software, all or part of the implementation may be realized in the form of a computer program product. The computer program product comprises one or more computer program instructions which, when loaded and executed on a device, cause a process or function according to an embodiment of the application to be performed, in whole or in part. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optics, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by the device or a data storage device, such as a server, a data center, etc., that is integrated into one or more available media. The usable medium may be a magnetic medium (such as a floppy Disk, a hard Disk, a magnetic tape, etc.), an optical medium (such as a Digital Video Disk (DVD), etc.), or a semiconductor medium (such as a solid state Disk, etc.).
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.
The above description is only an example of the present invention and should not be taken as limiting the present invention, and any modifications, equivalents, improvements and the like made within the principles of the present invention should be included in the scope of the present invention.

Claims (18)

1. A method of fault handling, the method comprising:
acquiring fault information through firmware and sending the fault information to an Operating System (OS);
determining, by the OS, a to-be-executed fault handling policy according to the fault information, and sending an execution command corresponding to the to-be-executed fault handling policy to the firmware;
and executing the fault handling strategy to be executed through the firmware.
2. The method of claim 1, wherein sending the fault information to an ACPI virtual device driver in the OS comprises:
and calling an error interface APEI of the ACPI platform through firmware to send the fault information to an ACPI virtual device driver in the OS.
3. The method of claim 2, wherein prior to determining, by the OS, the failure mode and the probability of the failure occurring as an uncorrectable error based on the failure information, the method further comprises:
recording the fault information in a target memory through the ACPI virtual equipment driver;
and inquiring the target memory by the equipment node controller in the OS according to a preset period to acquire the fault information.
4. The method according to claim 3, wherein the determining, by the OS, a to-be-executed fault handling policy according to the fault information and sending an execution command corresponding to the to-be-executed fault handling policy to the firmware comprises:
and determining a fault processing strategy to be executed by the equipment node controller according to the fault information, and sending an execution command corresponding to the fault processing strategy to be executed to the firmware.
5. The method of claim 4, wherein determining, by the device node controller, a fault handling policy to be implemented based on the fault information comprises:
and determining, by the device node controller, a failure mode and a probability of an uncorrectable error occurring in a failure according to the failure information, and determining a failure handling policy to be executed according to the failure mode and the probability.
6. The method according to claim 4 or 5, wherein the sending, to the firmware, the execution command corresponding to the to-be-executed fault handling policy includes:
and sending an execution command corresponding to the fault handling strategy to be executed to the ACPI virtual equipment.
7. The method according to claim 6, wherein the sending the execution command corresponding to the to-be-executed fault handling policy to the ACPI virtual device comprises:
and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.
8. The method of any of claims 1-7, wherein the firmware is a Basic Input Output System (BIOS).
9. An apparatus for fault handling, comprising a processor and a memory, wherein a plurality of programs corresponding to an OS and firmware, respectively, are stored in the memory, and the plurality of programs are read and executed by the processor to implement the method for fault handling according to any one of claims 1 to 8.
10. An apparatus for fault handling, the apparatus comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring fault information through firmware and sending the fault information to an Operating System (OS);
a sending module, configured to determine, by the OS according to the fault information, a to-be-executed fault handling policy, and send an execution command corresponding to the to-be-executed fault handling policy to the firmware;
and the execution module is used for executing the fault processing strategy to be executed through the firmware.
11. The apparatus of claim 10, wherein the obtaining module is configured to:
and calling an error interface APEI of the ACPI platform through firmware to send the fault information to an ACPI virtual device driver in the OS.
12. The apparatus of claim 11, wherein the obtaining module is further configured to:
recording the fault information in a target memory through the ACPI virtual equipment driver;
and inquiring the target memory by the equipment node controller in the OS according to a preset period to acquire the fault information.
13. The apparatus of claim 12, wherein the sending module is configured to:
and determining a fault processing strategy to be executed by the equipment node controller according to the fault information, and sending an execution command corresponding to the fault processing strategy to be executed to the firmware.
14. The apparatus of claim 13, wherein the sending module is configured to:
and determining, by the device node controller, a failure mode and a probability of an uncorrectable error occurring in a failure according to the failure information, and determining a failure handling policy to be executed according to the failure mode and the probability.
15. The apparatus of claim 13 or 14, wherein the sending module is configured to:
and sending an execution command corresponding to the fault handling strategy to be executed to the ACPI virtual equipment.
16. The apparatus of claim 15, wherein the sending module is configured to:
and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment driver, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.
17. The apparatus of any of claims 10-16, wherein the firmware is a Basic Input Output System (BIOS).
18. A computer-readable storage medium, in which a plurality of programs respectively corresponding to an OS and firmware are stored, the plurality of programs being for reading and execution by a processor to implement the method of fault handling according to any one of claim 1 to claim 8.
CN202110739085.5A 2021-06-30 2021-06-30 Method, apparatus and computer-readable storage medium for fault handling Pending CN115543666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110739085.5A CN115543666A (en) 2021-06-30 2021-06-30 Method, apparatus and computer-readable storage medium for fault handling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110739085.5A CN115543666A (en) 2021-06-30 2021-06-30 Method, apparatus and computer-readable storage medium for fault handling

Publications (1)

Publication Number Publication Date
CN115543666A true CN115543666A (en) 2022-12-30

Family

ID=84716845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110739085.5A Pending CN115543666A (en) 2021-06-30 2021-06-30 Method, apparatus and computer-readable storage medium for fault handling

Country Status (1)

Country Link
CN (1) CN115543666A (en)

Similar Documents

Publication Publication Date Title
EP3893114B1 (en) Fault processing method, related device, and computer storage medium
US8782469B2 (en) Request processing system provided with multi-core processor
JP7351933B2 (en) Error recovery method and device
US11132314B2 (en) System and method to reduce host interrupts for non-critical errors
US20210081234A1 (en) System and Method for Handling High Priority Management Interrupts
US11675645B2 (en) System and method for tracking memory corrected errors by frequency of occurrence while reducing dynamic memory allocation
CN111124728A (en) Automatic service recovery method, system, readable storage medium and server
WO2023109880A1 (en) Service recovery method, data processing unit and related device
US11726879B2 (en) Multiple block error correction in an information handling system
US10515682B2 (en) System and method for memory fault resiliency in a server using multi-channel dynamic random access memory
US10635554B2 (en) System and method for BIOS to ensure UCNA errors are available for correlation
US8032791B2 (en) Diagnosis of and response to failure at reset in a data processing system
US20200285520A1 (en) Information processor, information processing system, and method of processing information
CN116627702A (en) Method and device for restarting virtual machine in downtime
CN115543666A (en) Method, apparatus and computer-readable storage medium for fault handling
US8151028B2 (en) Information processing apparatus and control method thereof
US9176806B2 (en) Computer and memory inspection method
JP6256087B2 (en) Dump system and dump processing method
US11003778B2 (en) System and method for storing operating life history on a non-volatile dual inline memory module
TWI554876B (en) Method for processing node replacement and server system using the same
CA2498656A1 (en) Method for synchronizing events, particularly for processors of fault-tolerant systems
TWI781452B (en) System and method for monitoring and recovering heterogeneous components
US11783040B2 (en) Cryptographically verifying a firmware image with boot speed in an information handling system
US20240012651A1 (en) Enhanced service operating system capabilities through embedded controller system health state tracking
CN116483612A (en) Memory fault processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination