CN115543666A

CN115543666A - Method, apparatus and computer-readable storage medium for fault handling

Info

Publication number: CN115543666A
Application number: CN202110739085.5A
Authority: CN
Inventors: 张俊; 仇连根; 龚彬阳
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2022-12-30

Abstract

The application discloses a fault handling method, equipment and a computer readable storage medium, belonging to the technical field of RAS fault tolerance. The method comprises the following steps: and acquiring fault information through firmware, and reporting the fault information to the OS. And determining a fault processing strategy to be executed by the OS according to the fault information. Sending, by the OS to the firmware, the response to be executed and executing commands of the fault handling strategy. And executing a corresponding fault processing strategy to be executed according to the execution command through the firmware. The firmware in the application is not limited to only one fault handling policy which can be encoded, but a plurality of fault handling policies can be encoded, and the OS decides which fault handling policy to use and instructs the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.

Description

Method, apparatus and computer readable storage medium for fault handling

Technical Field

The present application relates to the field of RAS fault tolerance technologies, and in particular, to a method, an apparatus, and a computer-readable storage medium for fault handling.

Background

The server needs to process data frequently, and as the running time increases, the server inevitably has faults, such as memory faults, processor faults and the like. Therefore, it is necessary to deploy an underlying fault tolerance technology (also called an underlying Reliability, availability, serviceability, RAS) technology) to the server to handle the failure.

Currently, the underlying RAS technology is mainly implemented by firmware of a server, and specifically, the implementation of the underlying RAS technology may include the following processes: and after the chip detects the fault, reporting the interrupt to the firmware. Then, the firmware collects the fault information, if the memory fault is judged according to the fault information, the hard-coded memory fault processing strategy is executed, and if the memory fault is judged to be the processor fault, the hard-coded processor fault processing strategy is executed.

In the underlying RAS technology, a corresponding fault handling policy is hard-coded in firmware for memory faults, processor faults, and the like. In this case, if the failure handling policy is to be replaced, it is necessary to upgrade the firmware that currently hard-codes the failure handling policy and restart the server. However, when a server is restarted, it may cause service interruption of the server.

Disclosure of Invention

The embodiment of the application provides a method, equipment and a computer readable storage medium for fault processing, which can solve the problem of service interruption caused by restarting a server when a fault processing strategy is adjusted in the related art, and the technical scheme is as follows:

in a first aspect, a method for fault handling is provided, where the method includes:

the failure information is acquired by firmware, and the acquired failure information is transmitted to an Operating System (OS). And the OS determines a fault processing strategy to be executed according to the received fault information, and sends an execution command corresponding to the fault processing strategy to be executed to the firmware. And after receiving an execution command corresponding to the fault processing strategy to be executed through the firmware, executing the fault processing strategy to be executed.

In the solution shown in the embodiment of the present application, the firmware is a Basic Input Output System (BIOS) fault information, which includes a fault location, a fault level, a fault type, and the like. For memory failures, the failure types include FIFO overflow, timeout, etc.

After detecting the fault, the hardware writes the fault information into the designated CPU register. The BIOS obtains the fault information in the designated CPU register. And the BIOS reports the acquired fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the policy to be executed. And after receiving the execution command corresponding to the fault processing strategy to be executed, the BIOS sends an execution notice corresponding to the fault processing strategy to be executed to the hardware. After receiving the execution notice corresponding to the fault processing strategy to be executed, the hardware calls a code corresponding to the fault processing strategy to be executed in an operator module of the BIOS so as to realize fault processing.

It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.

In a possible implementation manner, the sending of the fault information to an Advanced Configuration and Power Management Interface (ACPI) virtual device driver (ACPI virtual device driver) in the OS by using firmware may specifically be: the firmware sends the fault information to an ACPI virtual device driver in the OS through an ACPI Platform Error Interface (APEI).

In the solution shown in the embodiment of the present application, a fault reporting module in the BIOS calls the APEI to report fault information to a fault notification chain registered in a kernel space (kernel space) of the OS. And the fault notification chain notifies an ACPI virtual device driver in the kernel space of the OS to acquire fault information.

In one possible implementation, before determining the failure mode and the probability of the failure occurring an uncorrectable error according to the failure information by the OS, the failure information may be obtained as follows:

the ACPI virtual device driver stores the fault information in a target memory, and a device node controller in the OS queries the target memory according to a preset period to acquire the fault information.

In a possible implementation manner, the OS determines a to-be-executed fault handling policy according to the fault information, and sends an execution command corresponding to the to-be-executed fault handling policy to the firmware, where the specific processing may be:

and the equipment node controller in the OS determines a fault processing strategy to be executed according to the fault information, and sends an execution command corresponding to the fault processing strategy to be executed to the firmware.

In a possible implementation manner, the specific process of determining, by the device node controller, the to-be-executed fault handling policy according to the fault information may be:

and the equipment node controller determines a fault mode and the probability of the occurrence of the uncorrectable error of the fault according to the fault information. And determining a fault processing strategy to be executed according to the fault mode and the probability.

In the solution shown in the embodiment of the present application, the device node controller may include a collector, a diagnotor, and a decider.

The above-mentioned equipment node controller inquires the target memory according to the preset cycle, obtains the fault information, and the concrete processing is: and a diagnotor in the equipment node controller determines a fault mode and the probability of uncorrectable errors of the fault according to the intelligent diagnosis model and the fault information. The intelligent diagnosis model can be constructed according to machine learning algorithms such as a threshold value grading algorithm and a forest tree algorithm. Before the intelligent diagnosis model is used, the intelligent diagnosis model can be trained through a large number of samples in advance, and specifically, one group of samples can comprise fault information, a fault mode corresponding to the fault information and fault uncorrectable probability corresponding to the fault information.

Taking the failure information as the memory failure information as an example, the failure mode may include a row failure, a column failure, and the like.

The device node controller determines a fault handling strategy to be executed according to the fault mode and the probability, and the specific handling may be: and the diagnotor sends the obtained failure mode and the probability of the uncorrectable error of the failure to a decision maker, and the decision maker determines a failure processing strategy to be executed according to the failure mode and the probability of the uncorrectable error of the failure.

The decision-making device can determine the fault processing strategy to be executed corresponding to the current obtained fault mode and the probability of the uncorrectable error according to the corresponding relation of the fault mode, the probability of the uncorrectable error and the fault processing strategy, which are stored in advance.

Taking the failure information as the memory failure information as an example, the failure processing policy may include: the method comprises the steps of storm suppression setting of memory fault interruption, period setting of memory polling, mirror image (Mirror) replacement execution, memory Rank replacement execution, memory Bank replacement execution, memory particle replacement execution, an ACLS (ARM Cache Line Sparing) method for repairing hard failure of a memory unit, PPR and the like.

In a possible implementation manner, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the firmware, and the specific processing may be:

and the OS sends an execution command corresponding to the fault processing strategy to be executed to the ACPI virtual equipment.

In a possible implementation manner, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the ACPI virtual device, and the specific processing may be:

and the decision maker calls a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sends an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.

In the solution shown in the embodiment of the present application, the target interface belongs to an acpimam interface encapsulated in an ACPI virtual device driver. The ACPI DSM interface corresponds to each fault handling policy for memory faults.

In a second aspect, a fault handling apparatus is provided, which includes a processor and a memory, where a plurality of programs corresponding to an OS and firmware are stored, and the programs are read and executed by the processor to implement the fault handling method according to the first aspect.

In a third aspect, an apparatus for fault handling is provided, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring fault information through firmware and sending the fault information to an Operating System (OS);

a sending module, configured to determine, by the OS according to the fault information, a to-be-executed fault handling policy, and send an execution command corresponding to the to-be-executed fault handling policy to the firmware;

and the execution module is used for executing the fault handling strategy to be executed through the firmware.

In a possible implementation manner, the obtaining module is configured to:

and calling an error interface APEI of the ACPI platform through firmware to send the fault information to an ACPI virtual device driver in the OS.

In a possible implementation manner, the obtaining module is further configured to:

recording the fault information in a target memory through the ACPI virtual equipment driver;

and inquiring the target memory by the equipment node controller in the OS according to a preset period to acquire the fault information.

In a possible implementation manner, the sending module is configured to:

and determining a fault processing strategy to be executed by the equipment node controller according to the fault information, and sending an execution command corresponding to the fault processing strategy to be executed to the firmware.

In a possible implementation manner, the sending module is configured to:

and determining, by the device node controller, a failure mode and a probability of an uncorrectable error occurring in a failure according to the failure information, and determining a failure handling policy to be executed according to the failure mode and the probability.

In a possible implementation manner, the sending module is configured to:

and sending an execution command corresponding to the fault handling strategy to be executed to the ACPI virtual equipment.

In a possible implementation manner, the sending module is configured to:

and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment drive, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.

In one possible implementation, the firmware is a basic input output system BIOS.

In a fourth aspect, a computer-readable storage medium is provided, in which a plurality of programs respectively corresponding to an OS and firmware are stored, and the programs are configured to be read and executed by a processor to implement the method for fault handling according to the first aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that the firmware in the present application is no longer limited to only one fault handling policy that can be encoded, but may be encoded with a plurality of fault handling policies, which fault handling policy is used by the OS decision, and instruct the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.

Drawings

Fig. 1 is a schematic architecture diagram of a server provided in an embodiment of the present application;

fig. 2 is a diagram comparing a fault handling method architecture provided in an embodiment of the present application;

FIG. 3 is a diagram of a fault handling architecture provided by an embodiment of the present application;

fig. 4 is a flowchart of a method for fault handling according to an embodiment of the present application;

FIG. 5 is a diagram of a fault handling architecture provided by an embodiment of the present application;

fig. 6 is a flowchart of a method for fault handling according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

The embodiment of the application provides a fault processing method which can be applied to a server, a storage system, a computer and the like. In the method, an Operating System (OS) implements a selection decision of a fault handling policy, and instructs a firmware to execute the fault handling policy determined by the selection decision, thereby completing the underlying fault-tolerant handling of the fault.

Referring to fig. 1, an architecture diagram of a server provided in an embodiment of the present application is shown.

The server shown in fig. 1 includes a processor 110, a memory 120, a bridge 130, a storage Controller 140, a hard disk 150, a flash memory 160, a network card 170, a graphics card 180, and a Baseboard Management Controller (BMC) 190. Among other things, the processor 110 may detect server failures, such as memory failures, processor failures, and the like. The flash memory 160 may store a BIOS. The hard disk 150 may store an OS. The memory 120 may store failure information.

Processor 110 extends the various interfaces through bridge 130. For example, the flash memory is connected through a Serial Peripheral Interface (SPI) of the bridge chip 130. The BMC is connected through a bridge piece expansion high-speed serial computer expansion bus (PCIE) interface, an asynchronous serial port and the like. The PCIE interface is expanded by the bridge chip 130 to connect the network card. The BMC is connected with the management network port, and the network card is connected with the service network port. In addition, the processor 110 may provide a Universal Serial Bus (USB) through a bridge chip.

In order to more clearly understand the difference between the fault handling method provided by the embodiment of the present application and the underlying fault tolerance method of the fault in the related art, the following describes the fault handling method and the underlying fault tolerance method separately with reference to fig. 2.

The left diagram in fig. 2 shows the underlying fault-tolerant method of failure in the related art. In the left graph, the firmware collects the fault information, and then executes the designated fault handling policy hard-coded in the firmware to realize the underlying fault tolerance of the fault. The firmware may report the failure information to the OS, and the OS performs software-level failure soft processing based on the failure information.

The right diagram in fig. 2 shows that the embodiment of the present application provides a fault handling method. In the right diagram, the firmware collects the fault information and then reports the fault information to the OS. And after the OS acquires the fault information, deciding a corresponding fault processing strategy according to the fault information, and indicating the firmware to execute the fault processing strategy. In addition, after the OS acquires the fault information, it can perform soft processing of the fault in the software layer according to the fault information.

The following briefly describes a fault handling method provided in the embodiment of the present application with reference to a fault handling architecture diagram shown in fig. 3.

And after detecting the fault, the hardware (hardware) reports fault information to the firmware (firmware). Specifically, the fault may be a memory fault, a processor fault, or the like. The hardware may be a processor of an X86 architecture or an advanced reduced instruction set machine (advanced RISC machine, ARM) architecture.

The firmware reports the fault information to the OS through an APEI interface, and an adaptive RAS (adaptive RAS organization) in a user space (user space) of the OS acquires the fault information. The adaptive RAS management layer is only an exemplary name, and may also be referred to as a fault management module, and the like, and is substantially a software module, and a specific name of the adaptive RAS management layer is not limited in this embodiment.

Then, the self-adaptive RAS management layer determines a fault processing strategy according to the fault information, and sends an execution command corresponding to the determined fault processing strategy to an RAS node driver (RAS node driver) in a kernel space (kernel space) of the OS. The RAS node driver issues an execution command to a firmware node (FW node) in the firmware through an adaptation driver (adaptation driver). The adaptive driver plays a role in executing the command to be relayed, so that the relayed executing command can be received and analyzed by the firmware node.

The firmware node receives the execution command and instructs the hardware to call corresponding code in a FW RAS Core (FW RAS Core) to execute the fault handling policy.

It should be noted that the RAS node driver, the adaptive driver, the firmware node, and the FW RAS core are all exemplary names, and are software modules in nature, and may also have other names.

Fig. 4 shows a flowchart of a method for processing a fault according to an embodiment of the present application. Referring to fig. 4, the method may include the steps of:

step 301, the firmware obtains the failure information.

The firmware is Basic Input Output System (BIOS) fault information including a fault location, a fault level, a fault type, and the like. For memory failures, the failure types include FIFO overflow, timeout, etc.

In an implementation, when a Central Processing Unit (CPU) detects a failure in a memory, CPU, or the like, the CPU generates failure information and writes the failure information in a designated CPU register. Then, the CPU sends an interrupt signal to the firmware. And after receiving an interrupt signal sent by the CPU, the firmware acquires fault information from a specified CPU register.

In particular, see the fault handling architecture diagram shown in fig. 5. The processing of the firmware in acquiring the failure information may be as follows:

an error report (error report) module in the firmware obtains the fault information from the specified CPU registers.

It should be noted that the CPU belongs to a part of the hardware in fig. 5, and specifically, the CPU may be a processor in an X86 architecture or a processor in an ARM architecture.

Step 302, the firmware sends failure information to the OS.

In an implementation, the firmware may report the failure information to the OS after acquiring the failure information.

In particular, see the fault handling architecture diagram shown in fig. 5. The process of reporting the fault information to the OS by the firmware may be as follows:

the fault reporting module calls the APEI to report the fault information to a fault notification chain (i.e., APEI driver in fig. 5) registered in a kernel space (kernel space) of the OS.

Then, the fault notification chain notifies an ACPI virtual device driver (ACPI virtual device driver) in the kernel space of the OS to acquire the fault information.

The ACPI virtual device driver acquires the fault information and records the fault information in the specified memory. And polling the specified memory by a device node controller (device node controller) in a user space (user space) of the OS to acquire fault information.

The plant node controller may include a collector, a diagnostician, and a decision maker. The process of polling the specified memory by the device node controller to obtain the fault information may specifically be: and the collector polls the appointed memory to obtain the fault information.

Note that the ACPI virtual device driver in fig. 5 corresponds to a combination of the RAS node driver and the adaptation driver in fig. 3. The device node controller in FIG. 5 is equivalent to adaptive RAS organization in FIG. 3.

The following description of the ACPI virtual device driver and the chain of fault notification is given:

before step 301 is executed, an ACPI virtual device (ACPI virtual device) may be integrated into the firmware and reported to the OS. Further, the kernel space of the OS may create an ACPI virtual device driver for the ACPI virtual device. Then, the ACPI virtual device driver registers a failure notification chain (notification chain).

Wherein, ACPI virtual device driver at least comprises the following functions:

register the failure notification chain, encapsulate an ACPI Device Specific Method (acpidms) interface, and provide the interface to the Device node controller.

Step 303, the OS determines a to-be-executed fault handling policy according to the fault information.

In implementation, the collector in the device node controller sends fault information to the diagnostor, and the diagnostor determines the corresponding fault mode and the probability of uncorrectable errors of the fault according to the intelligent diagnosis model.

Taking the fault information as the CPU fault information as an example, the fault mode may include a core internal high speed memory (Cache) failure, a logic execution unit fault, and the like.

It should be noted that the intelligent diagnosis model may be constructed according to a threshold classification algorithm, a forest tree algorithm, and other machine learning algorithms. Before use, the intelligent diagnosis model may be trained in advance through a large number of samples, and specifically, one group of samples may include fault information, a fault mode corresponding to the fault information, and a fault uncorrectable probability corresponding to the fault information.

And the diagnotor sends the obtained failure mode and the probability of the uncorrectable error of the failure to a decision maker, and the decision maker determines a failure processing strategy to be executed according to the failure mode and the probability of the uncorrectable error of the failure.

For example, the decision-making device may determine the to-be-executed fault handling policy corresponding to the currently obtained fault mode and the probability of the uncorrectable error occurring in the fault according to the pre-stored corresponding relationship between the fault mode, the probability of the uncorrectable error occurring in the fault, and the fault handling policy. Taking a memory fault as an example, when the fault mode is a ROW fault and the probability of the fault generating an uncorrectable error is greater than a preset threshold, it is determined that the fault handling policy to be executed is a PPR (Post Package report) method, which is a method for repairing a memory ROW error.

Step 304, the OS sends an execution command corresponding to the to-be-executed fault handling policy to the firmware.

In implementation, the decision maker calls a target interface corresponding to the fault handling policy to be executed and packaged in the ACPI virtual device driver, and sends an execution command to the ACPI virtual device through the target interface. The target interface belongs to an ACPI DSM interface packaged in an ACPI virtual device driver.

As shown in table 1 below, the ACPI DSM interface corresponds to each failure handling policy for memory failures.

TABLE 1

Step 305, the firmware executes the to-be-executed fault handling policy.

In implementation, after receiving an execution command corresponding to a to-be-executed fault handling policy, the ACPI virtual device sends an execution notification corresponding to the to-be-executed fault handling policy to the CPU. After receiving the execution notification corresponding to the to-be-executed fault processing strategy, the CPU calls a code corresponding to the to-be-executed fault processing strategy in an operator module of the firmware to realize fault processing.

Specifically, if the CPU is a processor of an X86 architecture, the ACPI virtual device may notify the execution of the to-be-executed fault handling policy through a System Management Interrupt (SMI).

If the CPU is an ARM architecture processor, the ACPI virtual device may notify the execution of the fault handling policy to be executed through a Serial Peripheral Interface (SPI) or a System Control and Management Interface (SCMI).

The following describes a processing flow of the method for fault handling in the fault handling architecture shown in fig. 5, with reference to a flow chart of the method for fault handling shown in fig. 6.

Step 501, detecting the fault by the hardware, acquiring fault information, and writing the fault information into a designated CPU register.

Step 502, the hardware sends an interrupt signal to the firmware.

In an implementation, the hardware may send an interrupt signal to the firmware after writing the fault information to the designated CPU register.

Step 503, the firmware reads the fault information in the designated CPU register.

In implementation, after receiving an interrupt signal sent by hardware, the firmware reads fault information in a designated CPU register.

Step 504, the firmware calls the APEI to send the fault information to the APEI driver.

In implementation, after reading the fault information in the designated CPU register, the firmware calls the APEI to send the fault information to the APEI driver.

And 505, the APEI driver informs the ACPI virtual device driver to acquire the fault information.

In implementation, after receiving the failure information, the APEI driver notifies the ACPI virtual device driver to acquire the failure information.

Step 506, the ACPI virtual device driver records the fault information in the designated memory.

In implementation, after acquiring the fault information, the ACPI virtual device driver writes the fault information into the specified memory.

And step 507, polling the specified memory by the collector in the device node controller to acquire the fault information.

In implementation, the collector in the device node controller queries the specified memory according to a preset period, and the collector can acquire the fault information in a certain period under the condition that the fault information is written in the specified memory.

Step 508, the collector sends the fault information to the diagnotor in the device node controller.

After the collector acquires the fault information, the collector sends the fault information to the diagnostor in the device node controller.

In step 509, the diagnotor determines the corresponding failure mode and the probability of the uncorrectable error occurring in the failure according to the failure information.

In implementation, after the diagnotor obtains the fault information, the fault information is input into the intelligent diagnosis model, and the intelligent diagnosis model outputs the corresponding fault mode and the probability of the uncorrectable error of the fault.

Step 510, the diagnotor sends the failure mode and the probability of the failure occurring an uncorrectable error to a decision maker in the device node controller.

In implementation, after obtaining the failure mode and the probability of the uncorrectable error, the diagnotor sends the failure mode and the probability of the uncorrectable error to the decision maker in the device node controller.

Step 511, the decision-making device determines the fault handling strategy to be executed according to the fault mode and the probability of the uncorrectable error of the fault.

In implementation, the decision-making device determines the currently obtained fault mode and the to-be-executed fault processing strategy corresponding to the probability of the uncorrectable error according to the pre-stored corresponding relationship among the fault mode, the probability of the uncorrectable error and the fault processing strategy.

Step 512, the decision maker calls a target interface corresponding to the to-be-executed fault handling policy packaged in the ACPI virtual device driver.

In implementation, after determining the to-be-executed fault processing policy, the decision maker calls a target interface corresponding to the to-be-executed fault processing policy packaged in the ACPI virtual device driver.

Step 513, the ACPI virtual device driver sends an execution command to the ACPI virtual device through the target interface.

And 514, the ACPI virtual device sends an execution notification corresponding to the to-be-executed fault handling policy to the hardware.

In implementation, if the hardware is a processor of an X86 architecture, the ACPI virtual device may notify the execution of the pending failover policy via an SMI.

If the hardware is a processor of an ARM architecture, the ACPI virtual device may notify the execution of the fault handling policy to be executed through the SPI or SCMI.

Step 515, the hardware calls a code corresponding to the to-be-executed fault handling policy in the operator module of the firmware, so as to implement fault handling.

It should be noted that the specific processing of each module in steps 501 to 515 is the same as the specific processing of the corresponding module in steps 301 to 305, and is not described herein again.

In the embodiment of the application, after acquiring the fault information, the firmware reports the fault information to the OS. And the OS determines a fault processing strategy to be executed according to the fault information. Then, the OS sends an execution command corresponding to the failure handling policy to be executed to the firmware. And finally, the firmware executes the corresponding fault processing strategy to be executed according to the execution command. It can be seen that, in the present application, the firmware is no longer limited to encode only one fault handling policy, but may encode multiple fault handling policies, and the OS decides which fault handling policy to use and instructs the firmware to execute the fault handling policy. Therefore, the fault processing strategy does not need to be adjusted by upgrading the firmware, and the problem of service interruption caused by restarting of the server when the fault processing strategy is adjusted is avoided.

Based on the same technical concept, an embodiment of the present application further provides a fault handling apparatus, as shown in fig. 7, the apparatus includes:

an obtaining module 710, configured to obtain fault information through firmware and send the fault information to an operating system OS, where the obtaining and sending functions in step 301 and step 302 and other implicit steps may be specifically implemented;

a sending module 720, configured to determine, by the OS according to the fault information, a to-be-executed fault handling policy, and send, to the firmware, an execution command corresponding to the to-be-executed fault handling policy, where the functions of determining and sending in step 303 and step 304, and other implicit steps may be specifically implemented;

the executing module 730 is configured to execute the to-be-executed fault handling policy through the firmware, and may specifically implement the executing function in step 305 described above and other implicit steps.

In a possible implementation manner, the obtaining module is configured to:

In a possible implementation manner, the obtaining module 710 is further configured to:

In a possible implementation manner, the sending module 720 is configured to:

and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment.

In a possible implementation manner, the sending module 720 is configured to:

It should be noted that: in the apparatus for processing a fault provided in the foregoing embodiment, only the division of each functional module is illustrated when performing fault processing, and in practical applications, the above function allocation may be completed by different functional modules as needed, that is, the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for fault handling and the method for fault handling provided by the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and will not be described herein again.

Referring to fig. 8, an embodiment of the application provides a schematic diagram of an apparatus 600. The device 600 may be a computer, server, etc. The device 600 comprises at least a processor 601, an internal connection 602, a memory 603.

In a possible implementation manner, the processor 601 may be a general processing unit (CPU), a Network Processor (NP), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program according to the present disclosure.

The internal connections 602 may include a path for passing information between the components. Optionally, the internal connection 602 is a single board or a bus.

The memory 603 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disc storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integrated with the processor.

The memory 603 is used for storing program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603, thereby causing the apparatus 600 to implement the functions of the present application.

In particular implementations, processor 601 may include one or more CPUs, such as CPU0 and CPU1 in fig. 8, as one embodiment.

In particular implementations, the device 600 may include multiple processors, as one embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any combination thereof, and when the implementation is realized by software, all or part of the implementation may be realized in the form of a computer program product. The computer program product comprises one or more computer program instructions which, when loaded and executed on a device, cause a process or function according to an embodiment of the application to be performed, in whole or in part. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optics, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by the device or a data storage device, such as a server, a data center, etc., that is integrated into one or more available media. The usable medium may be a magnetic medium (such as a floppy Disk, a hard Disk, a magnetic tape, etc.), an optical medium (such as a Digital Video Disk (DVD), etc.), or a semiconductor medium (such as a solid state Disk, etc.).

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only an example of the present invention and should not be taken as limiting the present invention, and any modifications, equivalents, improvements and the like made within the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of fault handling, the method comprising:

acquiring fault information through firmware and sending the fault information to an Operating System (OS);

determining, by the OS, a to-be-executed fault handling policy according to the fault information, and sending an execution command corresponding to the to-be-executed fault handling policy to the firmware;

and executing the fault handling strategy to be executed through the firmware.

2. The method of claim 1, wherein sending the fault information to an ACPI virtual device driver in the OS comprises:

3. The method of claim 2, wherein prior to determining, by the OS, the failure mode and the probability of the failure occurring as an uncorrectable error based on the failure information, the method further comprises:

4. The method according to claim 3, wherein the determining, by the OS, a to-be-executed fault handling policy according to the fault information and sending an execution command corresponding to the to-be-executed fault handling policy to the firmware comprises:

5. The method of claim 4, wherein determining, by the device node controller, a fault handling policy to be implemented based on the fault information comprises:

6. The method according to claim 4 or 5, wherein the sending, to the firmware, the execution command corresponding to the to-be-executed fault handling policy includes:

7. The method according to claim 6, wherein the sending the execution command corresponding to the to-be-executed fault handling policy to the ACPI virtual device comprises:

8. The method of any of claims 1-7, wherein the firmware is a Basic Input Output System (BIOS).

9. An apparatus for fault handling, comprising a processor and a memory, wherein a plurality of programs corresponding to an OS and firmware, respectively, are stored in the memory, and the plurality of programs are read and executed by the processor to implement the method for fault handling according to any one of claims 1 to 8.

10. An apparatus for fault handling, the apparatus comprising:

and the execution module is used for executing the fault processing strategy to be executed through the firmware.

11. The apparatus of claim 10, wherein the obtaining module is configured to:

12. The apparatus of claim 11, wherein the obtaining module is further configured to:

13. The apparatus of claim 12, wherein the sending module is configured to:

14. The apparatus of claim 13, wherein the sending module is configured to:

15. The apparatus of claim 13 or 14, wherein the sending module is configured to:

16. The apparatus of claim 15, wherein the sending module is configured to:

and calling a target interface corresponding to the to-be-executed fault processing strategy packaged in the ACPI virtual equipment driver, and sending an execution command corresponding to the to-be-executed fault processing strategy to the ACPI virtual equipment through the target interface.

17. The apparatus of any of claims 10-16, wherein the firmware is a Basic Input Output System (BIOS).

18. A computer-readable storage medium, in which a plurality of programs respectively corresponding to an OS and firmware are stored, the plurality of programs being for reading and execution by a processor to implement the method of fault handling according to any one of claim 1 to claim 8.