WO2024066589A1 - 一种硬件故障上报的处理方法及其相关设备 - Google Patents

一种硬件故障上报的处理方法及其相关设备 Download PDF

Info

Publication number
WO2024066589A1
WO2024066589A1 PCT/CN2023/104312 CN2023104312W WO2024066589A1 WO 2024066589 A1 WO2024066589 A1 WO 2024066589A1 CN 2023104312 W CN2023104312 W CN 2023104312W WO 2024066589 A1 WO2024066589 A1 WO 2024066589A1
Authority
WO
WIPO (PCT)
Prior art keywords
threshold
processing unit
computing device
independent processing
reporting
Prior art date
Application number
PCT/CN2023/104312
Other languages
English (en)
French (fr)
Inventor
李胜
张光彪
Original Assignee
超聚变数字技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 超聚变数字技术有限公司 filed Critical 超聚变数字技术有限公司
Publication of WO2024066589A1 publication Critical patent/WO2024066589A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the embodiments of the present application relate to the field of hardware, and in particular to a method for processing hardware fault reporting and related equipment.
  • CE correctable error
  • OS operating system
  • the embodiment of the present application provides a method for processing hardware fault reporting and related equipment, which are applied in the field of hardware.
  • the method for processing hardware fault reporting can minimize the impact of fault interruption reporting on normal business when there are too many CE faults, and can be applied to fault diagnosis systems with different capabilities and various application scenarios.
  • a method for processing a hardware fault report comprising:
  • the computing device obtains at least a first threshold and a second threshold through an algorithm of the independent processing unit, and the first threshold and the second threshold are stored in the independent processing unit;
  • the computing device determines that consecutive correctable errors CE occur based on a first threshold
  • the computing device stops reporting CE interrupt based on the number of consecutive CEs and a second threshold, where the CE interrupt is used to notify the occurrence of CE.
  • the computing device obtains the first threshold and the second threshold through the algorithm of an independent processing unit, can support real-time modification of each threshold data, and stops continuous reporting of CE interruptions based on each threshold data, which can minimize the impact of reporting of fault interruptions on normal business when there are too many CE faults, ensure the operation of normal business, and at the same time be applicable to more application environments, increasing the scope of application of this solution.
  • the computing device obtains the third threshold through an algorithm of an independent processing unit, and the third threshold is stored in the independent processing unit.
  • the computing device After the computing device stops reporting the CE interruption based on the number of consecutive CEs and the second threshold, the computing device continues to report the CE interruption based on the duration of stopping the CE interruption reporting and the third threshold.
  • the independent processing unit also obtains a third threshold, and can continue to report a CE interrupt based on the third threshold, can continue to provide fault data, and can continue to manage the fault status of the hardware structure of the computing device.
  • the computing device uses an algorithm of an independent processing unit based on a central processing unit.
  • the computing device determines the first threshold and the second threshold in real time based on the occupancy rate of the central processing unit (CPU), and the computing device determines the third threshold based on the capacity requirement of the fault diagnosis system through the algorithm of the independent processing unit.
  • an independent processing unit obtains a first threshold and a second threshold based on CPU occupancy, and determines a third threshold based on the capability requirements of a fault diagnosis system. This can adapt to application scenarios in real time, can adapt to the requirements of fault diagnosis systems with different capabilities, and improve the flexibility of the solution.
  • the computing device obtains the first threshold, the second threshold, and the third threshold customized by the user from the interface through an algorithm of the independent processing unit, and the algorithm supports obtaining data from the interface of the independent processing unit.
  • the computing device obtains the first threshold, the second threshold and the third threshold customized by the user from the interface through the algorithm of an independent processing unit.
  • the user modifies the first threshold, the second threshold and the third threshold in real time from the interface according to the current application scenario, and can obtain the corresponding threshold data according to the application scenario strategy, and can adapt to more application scenarios.
  • the computing device obtains the fourth threshold through an algorithm of an independent processing unit, and the fourth threshold is stored in the independent processing unit.
  • the computing device After the computing device continues to report the CE interruption based on the duration of stopping the CE interruption reporting and the third threshold, the computing device accumulates the target number of times, where the target number of times the CE interruption is continued to be reported after stopping the CE interruption reporting, and then the computing device permanently prohibits the reporting of the CE interruption based on the target number of times and the fourth threshold.
  • the computing device determines the occurrence of consecutive CEs based on a first threshold through BIOS, and then stops CE interrupt reporting based on the number of consecutive CEs and a second threshold through BIOS.
  • the computing device determines the occurrence of continuous CE based on a first threshold through a baseboard management controller (BMC) or OS.
  • BMC baseboard management controller
  • the computing device stops CE interrupt reporting based on the number of consecutive CEs and the second threshold through the BMC or OS.
  • the reporting of CE interruption can also be stopped through BMC or OS, which reflects the flexibility of the solution.
  • the independent processing unit is any one of the following:
  • Intelligent measurement unit IMU
  • management engine ME
  • OS the specific details are not limited here.
  • a computing device including a CPU and an independent processing unit, wherein the CPU is used to store BIOS;
  • the independent processing unit is used to obtain at least a first threshold value and a second threshold value through an algorithm, and the first threshold value and the second threshold value are stored in the independent processing unit;
  • the BIOS is used to determine that consecutive correctable errors CE occur based on a first threshold
  • BIOS is also used to accumulate the number of consecutive CEs
  • the BIOS is further configured to stop reporting of a CE interrupt based on the number of consecutive CEs and a second threshold value, wherein the CE interrupt is used to notify the occurrence of a CE.
  • the independent processing unit obtains the first threshold and the second threshold through an algorithm to support real-time modification of each threshold data.
  • the BIOS stops the continuous reporting of the CE interrupt based on the first threshold and the second threshold to minimize the impact of the reporting of fault interrupts on normal business when there are too many CE faults. At the same time, it can be applicable to more application environments, increasing the scope of application of this solution.
  • the independent processing unit is further configured to obtain a third threshold value through an algorithm, and the third threshold value is stored in the independent processing unit;
  • the independent processing unit is further configured to continue reporting the CE interruption based on the duration of stopping the CE interruption reporting and the third threshold.
  • the independent processing unit also obtains a third threshold, and can continue to report a CE interrupt based on the third threshold, can continue to provide fault data, and can continue to manage the fault status of the hardware structure of the computing device.
  • the independent processing unit is specifically used to determine the first threshold and the second threshold in real time based on the CPU occupancy through an algorithm, and is specifically used to determine the third threshold based on the capacity requirement of the fault diagnosis system through an algorithm.
  • an independent processing unit obtains a first threshold and a second threshold based on CPU occupancy, and determines a third threshold based on the capability requirements of a fault diagnosis system. This can adapt to application scenarios in real time, can adapt to the requirements of fault diagnosis systems with different capabilities, and improve the flexibility of the solution.
  • the independent processing unit is also used to obtain a fourth threshold through an algorithm, the fourth threshold is stored in the independent processing unit, and the independent processing unit is also used to accumulate a target number of times, the target number of times the CE interruption is continued to be reported after the reporting of the CE interruption is stopped, and then the independent processing unit is also used to permanently prohibit the reporting of the CE interruption based on the target number of times and the fourth threshold.
  • another computing device including a CPU, an independent processing unit, and a storage chip, wherein the storage chip is used to store BIOS, and the CPU is used to run the BIOS;
  • the independent processing unit is used to obtain at least a first threshold value and a second threshold value through an algorithm, and the first threshold value and the second threshold value are stored in the independent processing unit;
  • the BIOS is used to determine that consecutive correctable errors CE occur based on a first threshold
  • BIOS is also used to accumulate the number of consecutive CEs
  • the BIOS is further configured to stop reporting of a CE interrupt based on the number of consecutive CEs and a second threshold value, wherein the CE interrupt is used to notify the occurrence of a CE.
  • the independent processing unit obtains the first threshold and the second threshold through an algorithm to support real-time modification of each threshold data.
  • the BIOS stops the continuous reporting of CE interrupts based on the first threshold and the second threshold to minimize the impact of fault interrupt reporting on normal business when there are too many CE faults. At the same time, it can be applied to more application environments, increasing the scope of application of this solution. And the BIOS is stored in a storage chip, which increases the diversity of the solution.
  • another computing device including a CPU, an independent processing unit, and a BMC chip;
  • the independent processing unit is used to obtain at least a first threshold value and a second threshold value through an algorithm, and the first threshold value and the second threshold value are stored in the independent processing unit;
  • the BMC chip is used to determine the occurrence of continuous correctable errors CE based on a first threshold
  • the BMC chip is also used to accumulate the number of consecutive CEs
  • the BMC is further used to stop CE interrupt reporting based on the number of consecutive CEs and a second threshold value.
  • the CE interrupt is used to notify the occurrence of CE.
  • the independent processing unit obtains the first threshold and the second threshold through an algorithm to support real-time modification of each threshold data.
  • the BMC stops the continuous reporting of CE interruptions based on the first threshold and the second threshold to minimize the impact of fault interruption reporting on normal business when there are too many CE faults. At the same time, it can be applicable to more application environments, increase the scope of application of the present solution, and reflect the flexibility of the solution.
  • another computing device which may include a processor coupled to a memory, wherein the memory is used to store instructions, and the processor is used to execute the instructions in the memory so that the computing device performs the method described in the first aspect of the embodiment of the present application or any possible implementation of the first aspect.
  • another computing device comprising a processor for executing a computer program (or computer executable instructions) stored in a memory, wherein when the computer program (or computer executable instructions) is executed, the method in the first aspect and each possible implementation of the first aspect is executed.
  • the processor and the memory are integrated together;
  • the memory is located outside the computing device.
  • the computing device also includes a communication interface, which is used for the computing device to communicate with other devices, such as sending or receiving data and/or signals.
  • the communication interface can be a transceiver, circuit, bus, module or other type of communication interface.
  • the seventh aspect provides a computer-readable storage medium, including computer-readable instructions.
  • the computer-readable instructions are executed on a computer, the first aspect of the embodiment of the present application or any possible implementation method of the first aspect is enabled.
  • a computer program product comprising computer-readable instructions, which, when executed on a computer, enable the first aspect of the embodiment of the present application or any possible implementation of the first aspect.
  • FIG1 is a schematic diagram of an architecture of a computing device provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a method for processing hardware fault reporting provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 4 is another schematic diagram of an application scenario provided in an embodiment of the present application.
  • the embodiment of the present application provides a method for processing hardware fault reporting and related equipment, which are applied in the field of hardware.
  • the method for processing hardware fault reporting can minimize the impact of fault interruption reporting on normal business when there are too many CE faults, and can be applied to fault diagnosis systems with different capabilities and various application scenarios.
  • the embodiments of the present application involve a lot of relevant knowledge about CE failures.
  • the relevant terms and concepts that may be involved in the embodiments of the present application are first introduced below.
  • Fault interrupt When a CE fault occurs in the hardware structure of a computing device, a signal, namely a fault interrupt, is sent to the BIOS or OS to inform the BIOS or OS that a CE fault has occurred in the hardware structure. After receiving the fault interrupt, the BIOS senses that a CE fault has occurred in the hardware structure and obtains relevant fault information of the CE fault from the hardware structure.
  • CE storm refers to a certain number of CE failures occurring within a certain period of time.
  • CE storm suppression A CE storm processing mechanism in a computer system. Specifically, after a CE storm occurs, the fault interrupt is disabled and the interrupt is stopped from reporting to the BIOS.
  • CE storm suppression release A CE storm processing mechanism in a computer system. Specifically, after a certain period of CE storm suppression, the fault interrupt is reopened. When a CE fault occurs, the fault interrupt is still sent, which means it is released.
  • the fault diagnosis system can identify the hardware structure that may have uncorrectable faults in advance through the fault information of the CE faults obtained, and then replace these hardware structures to ensure the normal operation of the computing equipment.
  • the fault diagnosis system can be an in-band OS system or an out-of-band baseboard management controller (BMC) system, which is not limited here.
  • the hardware structure where the fault occurs can repair the fault by itself and send a fault interrupt to the BIOS to notify the BIOS that a CE fault has occurred.
  • the BIOS then obtains relevant fault information from the hardware structure where the CE fault occurs, such as the fault address, fault status, etc.
  • the BIOS receives the fault interrupt, the priority of obtaining the fault information from the hardware structure is higher than the priority of the normal business serving the OS operation, and the normal business serving the OS operation will be suspended at this time.
  • the occurrence of CE storm and whether to perform CE storm suppression and release are determined by four threshold data in BIOS.
  • the fixed values of the four threshold data in BIOS are set to x, y, z and w, respectively, where x is the threshold of the time interval between two adjacent CE failures, that is, if the time interval between two CE failures is not greater than x, then the two CE failures are determined to be continuous CE failures.
  • y is the threshold of the number of continuous CE failures, that is, when the time interval between two adjacent CE failures is less than or equal to x, the number of CE failures is continuously accumulated, and when the time interval between two adjacent CE failures is less than or equal to x, the number of CE failures is continuously accumulated.
  • the BIOS determines that a CE fault storm has occurred, and the BIOS executes CE storm suppression, that is, it no longer sends interrupts to report CE failures.
  • Z is the threshold of the time interval for releasing CE storm suppression, that is, when the time interval for executing CE storm suppression reaches z, the BIOS releases the executed CE storm suppression and continues to send interrupts to report CE failures, so that the fault diagnosis system can obtain CE data.
  • the BIOS performs storm suppression and notifies the independent processing unit to start timing.
  • the timing reaches z
  • the independent processing unit notifies the BIOS to release the suppression.
  • w is the threshold of the number of times the CE storm suppression and release are repeatedly performed. That is, if the BIOS repeatedly suppresses and releases the storm the number of times reaches w, the BIOS permanently suppresses the CE storm and no longer sends interrupts to report CE faults. That is, the fault diagnosis system no longer obtains CE data.
  • the embodiments of the present application provide a method for processing hardware fault reporting and related devices thereof, which are applied in the field of hardware.
  • the computing device obtains at least a first threshold and a second threshold through the algorithm of an independent processing unit, and the first threshold and the second threshold are stored in the independent processing unit. Then the computing device determines the occurrence of continuous correctable error CE based on the first threshold, and the computing device accumulates the number of continuous CEs, and stops CE interrupt reporting based on the number of continuous CEs and the second threshold.
  • the CE interrupt is used to notify CE.
  • the computing device obtains the first threshold and the second threshold through the algorithm of the independent processing unit, which can support real-time modification of each threshold data, and can minimize the impact of fault interrupt reporting on normal business when there are too many CE faults. At the same time, it can be applied to more application environments, increasing the scope of application of this solution.
  • Figure 1 is a schematic diagram of the architecture of a computing device provided by the embodiment of the present application, wherein the hardware of the computing device includes:
  • the computing device also includes BMC chip 105, which stores the BMC system.
  • the BIOS When a hardware component structure in the computing device fails, a fault interrupt will be sent to the BIOS to inform the BIOS, and then the BIOS collects the fault information and reports it to the fault diagnosis system, i.e., OS and/or out-of-band BMC system, and then the OS and/or out-of-band BMC system performs fault diagnosis based on the fault information, and then replaces the faulty hardware structure to ensure the normal operation of the computing device.
  • the fault diagnosis system i.e., OS and/or out-of-band BMC system
  • the OS and/or out-of-band BMC system performs fault diagnosis based on the fault information, and then replaces the faulty hardware structure to ensure the normal operation of the computing device.
  • the BIOS can collect fault logs from faults occurring in the hardware structure of the computing device, and perform CE storm determination/suppression/removal.
  • the OS mainly includes the operation of major services and processes, and can also be used as a fault diagnosis system to obtain fault information from the BIOS for fault analysis and fault diagnosis.
  • the BMC can be used as an out-of-band fault diagnosis system to obtain fault information from the BIOS for fault analysis and fault diagnosis.
  • the BMC can also implement CE storm control.
  • the CE fault hardware structure can repair itself, and the hardware structure sends a CE interrupt, i.e., a fault interrupt, to notify the BIOS of a CE fault.
  • a CE interrupt is an interrupt used to notify the hardware structure of a CE fault, and the BIOS sends a CE interrupt based on the CE.
  • the system continuously obtains fault information and reports it to the OS or BMC for fault diagnosis and other operations.
  • the hardware structure may send a CE interrupt to the OS to notify the OS of the CE failure in the hardware structure, wherein the OS has a function of BIOS obtaining a CE interrupt.
  • the hardware structure may also send a CE interrupt to the BMC to notify the BMC that the CE failure occurs in the hardware structure, wherein the BMC also has the aforementioned BIOS obtaining CE interrupt function.
  • the computing device also includes an independent processing unit 104.
  • the independent processing unit 104 is used to obtain threshold data such as a first threshold, a second threshold, a third threshold and/or a fourth threshold corresponding to the scenario strategy.
  • the independent processing unit 104 can obtain various threshold data based on the CPU occupancy rate and the ability or demand of the fault diagnosis system to obtain fault information, or the independent processing unit 104 can obtain various threshold data defined by the user based on the scenario strategy, as shown in the following various method embodiments, and will not be described in detail here.
  • the BIOS running in the CPU 101 obtains the first threshold and the second threshold from the independent processing unit 104, and determines whether a continuous CE failure occurs based on the first threshold, and accumulates the number of continuous CE failures. When the accumulated number reaches the second threshold, it is determined that a CE storm occurs, and the reporting of the CE interrupt is stopped.
  • the independent processing unit 104 is also used to continue reporting the CE interrupt based on the third threshold and the duration of the CE interrupt stop reporting.
  • each threshold data is obtained by an independent processing unit.
  • the BIOS receives the CE interrupt, it obtains the first threshold and the second threshold from the independent processing unit.
  • the BIOS stops the CE interrupt reporting, it notifies the independent processing unit to time the time period for stopping the CE interrupt reporting.
  • the third threshold is reached, the CE storm suppression is released, that is, the CE interrupt reporting is turned on.
  • CE storm suppression can avoid the impact of excessive CE interrupt reporting on the normal business of the OS, avoid the jamming or downtime of the normal business operation, and enable the normal business of the OS to run normally.
  • each threshold data is obtained dynamically and in real time by the independent processing unit based on the scenario strategy, and can adapt in real time according to the change of the application scenario, and thus can adapt to more application scenarios.
  • the computing device shown in Figure 1 can be a server, a personal computer, a computer, a cluster server, a vehicle-mounted computing device, a tablet, a storage system, or the like. It can be understood that in actual application scenarios, it can also be other computing devices, which are not specifically limited here.
  • FIG. 2 is a schematic diagram of a method for processing a hardware fault report provided by the embodiment of the present application, specifically including:
  • An independent processing unit obtains threshold data (x, y, z) corresponding to a scenario strategy.
  • x is the threshold of the time interval between two adjacent CE failures
  • y is the threshold of the number of consecutive CE failures
  • Z is the threshold of the time interval for releasing CE storm suppression.
  • x is the threshold of the time interval between two adjacent CE failures
  • y is the threshold of the number of consecutive CE failures
  • Z is the threshold of the time interval for releasing CE storm suppression.
  • y can be increased within a certain range, or the current computing device has a high business intensity, that is, it occupies
  • the utilization rate is high and too many CE interruptions can easily affect the normal operation of the business.
  • the value of y can be appropriately reduced, and the specific value is not limited here.
  • the independent processing unit may be implemented by software of a computing device.
  • the BMC shown in FIG. 1 above may be used as an independent processing unit, and may obtain threshold data (x, y, z) corresponding to a scenario strategy, wherein the threshold data (x, y, z) may be modified by a user in real time according to the current application scenario, or the threshold data (x, y, z) may be obtained by an algorithm dynamically adjusting the device parameters of the current computing device, which is not specifically limited here.
  • Obtaining the corresponding threshold data in real time according to the application scenario strategy can adapt to a wider range of application scenarios.
  • the specific implementation method of obtaining the threshold data (x, y, z) is described in subsequent examples, and will not be described in detail here.
  • the independent processing unit also obtains threshold data w corresponding to the scenario strategy, where w is a permanent suppression flag.
  • w is a permanent suppression flag.
  • the independent processing unit also obtains threshold data w, and CE storm suppression can be implemented permanently based on threshold data w.
  • w can be adjusted as large as possible, such as infinity.
  • permanent suppression can also be implemented to avoid excessive CE interruptions from affecting the normal operation of the computing device.
  • the independent processing unit of the computing device may be an intelligent management unit (IMU) or a management engine (ME). It is understandable that the independent processing unit may also be other hardware structures, which are not specifically limited here.
  • the independent processing unit implemented in hardware may obtain the threshold data (x, y, z) input by the user according to its own data serial port, or obtain the threshold data (x, y, z) corresponding to the computing device in the current application scenario according to its own internal algorithm.
  • the independent processing unit dynamically adjusts the threshold data x to 120 seconds and y to 5 times based on the CPU occupancy rate.
  • an independent processing unit obtains corresponding threshold data (x, y) based on the CPU occupancy rate.
  • the independent processing unit can automatically generate threshold data (x, y) corresponding to the current application scenario in real time and dynamically, can adapt to the application scenario in real time, and improve the flexibility of the solution.
  • an independent processing unit dynamically adjusts the threshold data (z, w) based on the fault diagnosis system's ability to handle the fault threshold, which can adapt to the needs of fault diagnosis systems with different capabilities, for example, it can support the fault diagnosis system to obtain more fault information.
  • the independent processing unit obtains the threshold data (x, y, z) and stores it in its own storage area. Specifically, as shown in step 202:
  • the independent processing unit obtains the threshold data (x, y, z) and stores it in its own storage area.
  • the threshold data (x, y, z) can be read from the storage area of the independent processing unit to implement subsequent operations.
  • the BIOS or BMC or OS of the computing device may obtain threshold data (x, y, z) from an independent processing unit to determine whether a CE storm occurs, and when it is determined that a CE storm occurs, perform CE storm suppression, and release CE storm suppression, as described in the following steps:
  • a CE failure occurs in the hardware structure, and a CE interrupt is sent.
  • the hardware structure When a CE failure occurs in the hardware structure of the computing device, the hardware structure will send a CE interrupt to the BIOS, so that the BIOS can sense that the CE failure occurs in the hardware structure.
  • the hardware structure may also send a CE interrupt to the OS or BMC to notify the OS or BMC that a CE failure has occurred in the hardware structure.
  • the OS or BMC supports obtaining a CE interrupt and modifying a switch of the CE interrupt.
  • the BIOS determines whether a CE storm occurs based on the CE interrupt and takes corresponding measures. For details, please refer to the following steps 204 to 208:
  • BIOS obtains threshold data (x, y).
  • the BIOS obtains the threshold data (x, y) from the independent processing unit.
  • the BIOS may obtain the threshold data (x, y) from a storage area of the independent processing unit.
  • BIOS determines CE storm based on (x, y).
  • the two CE interrupts are determined to be continuous CE interrupts, and the number of continuous CE interrupts is recorded.
  • the number of continuous CE interrupts reaches y, it is determined that a CE storm has occurred.
  • the number of continuous CE interrupts is less than y, it is determined that no CE storm has occurred.
  • the number of non-continuous CE interrupts does not reach y, and the time interval between two CE interrupts is greater than x, the number of continuous CE interrupts recorded is cleared.
  • BIOS performs CE storm suppression.
  • the BIOS After the BIOS determines that a CE storm has occurred, it performs CE storm suppression.
  • the BIOS determines that a CE storm has occurred, the BIOS turns off the switch for reporting CE interrupts, prohibiting the hardware structure of the computing device from sending CE interrupts to the BIOS, thereby preventing the continuous reporting of CE interrupts from affecting the normal operation of services.
  • BISO when BISO performs CE storm suppression, it will also notify the independent processing unit of the execution of the CE storm suppression, so that the independent processing unit performs storm suppression timing based on the threshold data z. Specifically, as described in step 207:
  • the independent processing unit performs CE storm suppression timing.
  • the independent processing unit When the independent processing unit performs CE storm suppression in the BIOS, the independent processing unit will obtain the threshold data z and perform storm release time timing. Exemplarily, the independent processing unit can count down z, and when z is decremented to 0, the independent processing unit sends a CE storm suppression release task to the BIOS, or the independent processing unit can start timing, and when the timing value reaches the threshold data z, the independent processing unit sends a CE storm suppression release task to the BIOS. It can be understood that in actual situations, the implementation method of the independent processing unit performing CE storm suppression timing based on the threshold data z can also be other implementation methods, which are not specifically limited here.
  • an independent processing unit performs CE storm suppression release time timing based on threshold data z, which can effectively release CE storm suppression and restore CE interrupt reporting, so that BIOS can continue to obtain fault information and ensure real-time management of computing device faults as much as possible.
  • BIOS releases CE storm suppression.
  • the BIOS When the independent processing unit determines that the duration of CE storm suppression reaches the threshold value z based on the threshold data z, the BIOS will receive the CE storm suppression release task issued by the independent processing unit and execute the CE storm suppression release. After the suppression is released, when a CE fault occurs, the hardware structure can still send a CE interrupt to the BIOS to inform the BIOS that a CE fault has occurred, so that the corresponding operations can be performed normally based on the CE interrupt, such as obtaining CE fault information, and the fault diagnosis system diagnoses, locates and analyzes the CE fault based on the CE fault information, etc.
  • the BIOS when the number of times the BIOS performs CE storm suppression release reaches the threshold value w, the BIOS performs CE storm permanent suppression. Specifically, as described in step 209 below:
  • BIOS performs permanent CE storm suppression.
  • the independent processing unit counts the number of times the BIOS executes CE storm suppression release. When the count value reaches the threshold data w, the permanent suppression flag w is valid, and the independent processing unit will not issue the CE storm suppression release task to the BIOS, thereby achieving permanent CE storm suppression.
  • the threshold data obtained by the independent processing unit is used to suppress and release the CE fault storm, and the threshold can be changed in real time according to the application environment. It is suitable for more application scenarios, and the threshold data can be adaptively modified to improve the flexibility of the solution, which is suitable for fault diagnosis systems with different working capabilities.
  • FIG3 is a schematic diagram of an application scenario provided by the embodiment of the present application.
  • the computing device is described by taking a server with an X86 CPU as an example. Since there is no independent processing unit independent of the main CPU inside the X86 CPU, the BMC is used as an independent processing unit in FIG3 to obtain threshold data and CE storm suppression timing and other operations. At the same time, based on the out-of-band management of the BMC, the user can modify the threshold data based on out-of-band interaction, and the BIOS reads the threshold data from the BMC to determine the CE storm and perform CE storm suppression and release operations. Specific implementation The process is as follows:
  • the BMC obtains threshold data through a web page.
  • a user can input threshold data (x, y, z and/or w) in the BMC based on out-of-band interaction through a web page, i.e., an input interface or an input program, and the specific value of the threshold data (x, y, z and/or w) can be modified by the user in real time based on the application environment, which is not limited here.
  • the user can adjust the threshold data according to the tolerance of the computing device system to the CE storm, or the BMC can dynamically adjust the threshold data based on the change in the demand for CE fault information obtained by the fault diagnosis system (i.e., BMC or OS), so as to implement the present solution for a variety of application scenarios.
  • the fault diagnosis system i.e., BMC or OS
  • the hardware structure When the hardware structure detects that a CE fault occurs in itself, it will repair the CE fault and execute the following step 303 to send a CE interrupt to the BIOS.
  • the hardware structure sends a CE interrupt to the BIOS.
  • the hardware structure When the hardware structure detects that a CE interrupt has occurred in itself, it sends a CE interrupt to the BIOS, so that the BIOS senses that a CE failure has occurred.
  • BIOS obtains threshold data from BMC.
  • the BIOS reads the threshold data (x, y) from the BMC, and then executes step 305 to determine the CE storm based on the threshold data.
  • BIOS determines CE storm.
  • the BIOS determines the CE storm based on the read threshold data (x, y). Specifically, when the reporting time interval of two adjacent CE interrupts is less than x, the two CE interrupts are determined to be continuous CE interrupts, and the continuous CE interrupts are counted. When the time interval between two adjacent CE interrupts is greater than x, the count of continuous CE interrupts is cleared, and when the count value of continuous CE interrupts reaches the threshold data y, it is determined that a CE storm is currently generated. Then the BIOS executes the following step 306 to suppress the CE storm.
  • BIOS performs CE storm suppression.
  • CE storm suppression is performed. Specifically, the BIOS turns off the switch for reporting CE interrupts, prohibiting the reporting of CE interrupts, thereby avoiding continuous reporting of CE interrupts and affecting the normal operation of services, and avoiding the normal operation of services being stuck or downtime caused by continuous reporting of CE interrupts.
  • BIOS reports suppression events to BMC.
  • BIOS executes CE storm suppression
  • BMC performs suppression timing.
  • the BMC obtains the threshold data z and starts the suppression timing.
  • the BMC can execute a decrement count of z, and execute step 309 when it decrements to 0; or start counting from 0, and execute step 309 when the timing reaches the threshold data z.
  • the specifics are not limited here.
  • BMC issues the CE storm suppression release task.
  • the BMC sends a CE storm suppression release task to the BIOS.
  • BIOS executes CE storm suppression release.
  • the BIOS After receiving the CE storm suppression release task sent by the BMC, the BIOS executes the CE storm suppression release. Exemplarily, the BIOS turns on the switch for reporting CE interrupts. When a CE fault occurs, the BIOS normally receives the CE interrupt reported by the hardware structure, senses the occurrence of the CE fault, or obtains CE fault information and sends it to the fault diagnosis system.
  • the BMC obtains the threshold data w from the user, and when the number of times the BMC sends the CE storm suppression release reaches the threshold data w, the threshold data w is triggered, that is, the permanent suppression flag is valid.
  • the BIOS implements permanent CE storm suppression, that is, the switch for reporting CE interrupts is no longer turned on, and CE interrupts are never obtained. This can reduce the workload of the computing device and avoid high probability of executing CE storm suppression and CE storm suppression release.
  • the BMC in the above FIG. 3 is an independent processing unit, and is an example of implementing an independent processing unit in the form of software to illustrate the specific implementation of the independent processing unit. It is only used as an example to understand the embodiments of the present application, and does not substantially limit the embodiments of the present application. It can be understood that in actual situations, other software can also be used to implement the independent processing unit, which is not specifically limited here.
  • FIG 3 illustrates the application scenario of the software BMC as an independent processing unit.
  • the following is an example of the application scenario of the hardware IMU as an independent processing unit. Please refer to Figure 4 for details.
  • Figure 4 is another schematic diagram of the application scenario provided in an embodiment of the present application.
  • the computing device is a server including a reduced instruction set machine (ARM) CPU as an example. Since the ARM CPU has an IMU or ME hardware unit that is independent of the main CPU, the IMU or ME can obtain threshold data and execute storm suppression timing, and the BIOS can obtain threshold data from the IMU or ME, determine the CE storm, and execute CE storm suppression and release.
  • ARM reduced instruction set machine
  • IMU obtains threshold data.
  • the user can input threshold data (x, y, z and/or w) into the IMU through its own serial port.
  • the specific value of the threshold data (x, y, z and/or w) can be determined by the user based on the application environment and then modified in real time through the serial port of the IMU. It can be understood that the value of the threshold data in each application environment can be determined according to actual conditions, and no specific limitation is made here.
  • the IMU can also adjust the threshold data based on the current tolerance of the computing device to the CE storm, such as dynamically adjusting the threshold data based on the CPU occupancy rate as described in the aforementioned FIG. 2, and dynamically adjusting the threshold data based on the changes in the demand for CE fault information obtained by the fault diagnosis system (ie, BMC or OS).
  • the fault diagnosis system ie, BMC or OS
  • the user can adjust the threshold data according to the CE storm tolerance of the computing device system, or the BMC can dynamically adjust the threshold data based on changes in the CE fault information requirements obtained by the fault diagnosis system, thereby implementing the present solution for a variety of application scenarios.
  • the hardware structure When the hardware structure detects that a CE fault occurs in itself, it will repair the CE fault and execute the following step 403 to send a CE interrupt to the BIOS.
  • the hardware structure sends a CE interrupt to the BIOS.
  • BIOS obtains threshold data from the IMU.
  • BIOS determines CE storm.
  • BIOS performs CE storm suppression.
  • BIOS reports the suppression event to BMC.
  • BIOS executes CE storm suppression
  • IMU performs suppression timing.
  • the IMU starts to suppress the Timing.
  • BMC issues a CE storm suppression release task.
  • BIOS executes CE storm suppression release.
  • each functional module or unit in each embodiment of the present application may be integrated into a processor, and the above-mentioned integrated modules or units may be implemented in the form of hardware or in the form of software functional modules.
  • An embodiment of the present application also provides a computer-readable storage medium, including computer-readable instructions.
  • the computer-readable instructions When the computer-readable instructions are executed on a computer, the computer executes any one of the implementation methods shown in the aforementioned method embodiments.
  • the embodiments of the present application also provide a computer program product, which includes a computer program or instructions.
  • a computer program product which includes a computer program or instructions.
  • the computer program or instructions When the computer program or instructions are executed on a computer, the computer executes any one of the implementation methods shown in the aforementioned method embodiments.
  • the embodiment of the present application also provides a chip or chip system, which may include a processor.
  • the chip may also include a memory (or storage module) and/or a transceiver (or communication module), or the chip is coupled to a memory (or storage module) and/or a transceiver (or communication module), wherein the transceiver (or communication module) can be used to support the chip for wired and/or wireless communication, and the memory (or storage module) can be used to store a program or a set of instructions, and the processor calls the program or the set of instructions to implement the above method embodiment, the operation performed by the terminal or network device in any possible implementation of the method embodiment.
  • the chip system may include the above chip, and may also include the above chip and other separate devices, such as a memory (or storage module) and/or a transceiver (or communication module).
  • the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the technical solution of the embodiment of the present application can be essentially or partly contributed to the prior art in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods of each embodiment of the present application.
  • a readable storage medium such as a computer floppy disk, USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
  • ROM read-only memory
  • RAM random access memory
  • disk or optical disk etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种硬件故障上报的处理方法及其相关设备,应用于硬件领域中。包括:计算设备(100)通过独立处理单元(104)的算法至少获取第一阈值以及第二阈值,第一阈值以及第二阈值存储于独立处理单元(104),且基于第一阈值确定发生连续的可纠正错误CE,并累计连续的CE的次数,然后基于连续的CE的次数以及第二阈值停止CE中断上报,CE中断用于通告发生CE。计算设备(100)通过独立处理单元的算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,且基于各个阈值数据停止CE中断的持续上报,能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,保证正常业务的运行,同时能适用更多的应用环境,增加本方案的适用范围。

Description

一种硬件故障上报的处理方法及其相关设备
本申请要求于2022年09月28日提交中国专利局、申请号为202211192364.5、申请名称为“一种硬件故障上报的处理方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及硬件领域,尤其涉及一种硬件故障上报的处理方法及其相关设备。
背景技术
当前计算设备的硬件部件出现可纠正故障(correctable error,CE)时,硬件自身会修复故障并向基本输入输出系统(basic input output system,BIOS)发送中断,用于通知BIOS有CE故障发生。
而当一定时间内发生的CE故障较多时,过多的中断会影响操作系统(operating system,OS)业务的正常运行。
发明内容
本申请实施例提供了一种硬件故障上报的处理方法及其相关设备,应用于硬件领域中。该硬件故障上报的方法能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,且能适用于不同能力的故障诊断系统,适用于多种应用场景。
第一方面,提供了一种硬件故障上报的处理方法,包括:
计算设备通过独立处理单元的算法至少获取第一阈值以及第二阈值,第一阈值以及第二阈值存储于独立处理单元;
计算设备基于第一阈值确定发生连续的可纠正错误CE;
计算设备累计连续的CE的次数;
计算设备基于连续的CE的次数以及第二阈值停止CE中断上报,CE中断用于通告发生CE。
在本申请的实施方式中,计算设备通过独立处理单元的算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,且基于各个阈值数据停止CE中断的持续上报,能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,保证正常业务的运行,同时能适用更多的应用环境,增加本方案的适用范围。
在第一方面的一种可能实现方式中,计算设备通过独立处理单元的算法获取第三阈值,第三阈值存储于独立处理单元。
且在计算设备基于连续的CE的次数以及第二阈值停止CE中断上报之后,计算设备基于停止CE中断上报的时长以及第三阈值继续上报CE中断。
在本申请的实施方式中,独立处理单元还获取第三阈值,且基于第三阈值能继续上报CE中断,可以持续提供故障数据,能持续管理计算设备的硬件结构的故障状态。
在第一方面的一种可能实现方式中,计算设备通过独立处理单元的算法基于中央处理 器(central processing unit,CPU)的占用率实时确定第一阈值以及第二阈值,以及计算设备通过独立处理单元的算法基于故障诊断系统的能力需求确定第三阈值。
在本申请的实施方式中,独立处理单元获取基于CPU占用率得到的第一阈值以及第二阈值,并基于故障诊断系统的能力需求确定第三阈值,能实时适应应用场景,可以适应不能能力的故障诊断系统的需求,提升方案的灵活性。
在第一方面的一种可能的实现方式中,计算设备通过独立处理单元的算法从接口获取由用户自定义的第一阈值、第二阈值以及第三阈值,其算法支持从独立处理单元的接口获取数据。
在本申请的实施方式中,计算设备通过独立处理单元的算法从接口获取由用户自定义的第一阈值、第二阈值以及第三阈值,用户根据当前应用场景从接口实时修改第一阈值、第二阈值以及第三阈值,能根据应用场景策略获取对应的各个阈值数据,能适应较多的应用场景。
在第一方面的一种可能实现方式中,计算设备通过独立处理单元的算法获取第四阈值,第四阈值存储于独立处理单元。
在计算设备基于停止CE中断上报的时长以及第三阈值继续上报CE中断之后,计算设备累计目标次数,其目标次数为停止CE中断的上报后继续上报CE中断的次数,然后计算设备基于目标次数以及第四阈值永久禁止CE中断的上报。
在本申请的实施方式中,通过执行永久禁止CE中断的上报,可以有效的避免高概率发生CE故障的硬件结构经常上报CE中断,且CE故障可以进行自愈,进而避免了CE中断的不断上报影响正常业务的运行,且避免一直执行CE风暴抑制以及解除,减少计算设备的工作负担。
在第一方面的一种可能实现方式中,计算设备通过BIOS基于第一阈值确定发生连续的CE。然后计算设备通过BIOS基于连续的CE的次数以及第二阈值停止CE中断上报。
在本申请的实施方式中,举例说明了BIOS执行停止中断上报的操作,体现了方案的可靠性。
在第一方面的一种可能实现方式中,计算设备通过基板管理控制器(baseboard management controller,BMC)或OS基于第一阈值确定发生连续的CE。
以及计算设备通过BMC或OS基于连续的CE的次数以及第二阈值停止CE中断上报。
在本申请的实施方式中,还可以通过BMC或OS实现停止CE中断的上报,体现了方案的灵活性。
在第一方面的一种可能实现方式中,独立处理单元为以下任意一种:
智能管理单元(inertial measurement unit,IMU)或管理引擎(management engine,ME)、BMC或OS,具体此处不做限定。
在本申请的实施方式中,例举了独立处理单元的多种可能实现方式,体现了方案的多样性以及灵活性。
第二方面,提供一种计算设备,包括CPU以及独立处理单元,该CPU用于存储BIOS;
独立处理单元用于通过算法至少获取第一阈值以及第二阈值,第一阈值以及第二阈值存储于独立处理单元;
BIOS用于基于第一阈值确定发生连续的可纠正错误CE;
BIOS还用于累计连续的CE的次数;
BIOS还用于基于连续的CE的次数以及第二阈值停止CE中断上报,CE中断用于通告发生CE。
在本申请的实施方式中,独立处理单元通过算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,BIOS基于第一阈值以及第二阈值停止CE中断的持续上报能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,同时能适用更多的应用环境,增加本方案的适用范围。
在第二方面的一种可能的实现方式中,独立处理单元,还用于通过算法获取第三阈值,第三阈值存储于独立处理单元;
独立处理单元还用于基于停止CE中断上报的时长以及第三阈值继续上报CE中断。
在本申请的实施方式中,独立处理单元还获取第三阈值,且基于第三阈值能继续上报CE中断,可以持续提供故障数据,能持续管理计算设备的硬件结构的故障状态。
在第二方面的一种可能的实现方式中,独立处理单元,具体用于通过算法基于CPU的占用率实时确定第一阈值以及第二阈值,且具体用于通过算法基于故障诊断系统的能力需求确定第三阈值。
在本申请的实施方式中,独立处理单元获取基于CPU占用率得到的第一阈值以及第二阈值,并基于故障诊断系统的能力需求确定第三阈值,能实时适应应用场景,可以适应不能能力的故障诊断系统的需求,提升方案的灵活性。
在第二方面的一种可能的实现方式中,独立处理单元,还用于通过算法获取第四阈值,第四阈值存储于独立处理单元,并且独立处理单元还用于累计目标次数,目标次数为停止CE中断的上报后继续上报CE中断的次数,然后独立处理单元还用于基于目标次数以及第四阈值永久禁止CE中断的上报。
在本申请的实施方式中,通过执行永久禁止CE中断的上报,可以有效的避免高概率发生CE故障的硬件结构经常上报CE中断,且CE故障可以进行自愈,进而避免了CE中断的不断上报影响正常业务的运行,且避免一直执行CE风暴抑制以及解除,减少计算设备的工作负担。
第三方面,提供另一种计算设备,包括CPU、独立处理单元以及存储芯片,存储芯片用于存储BIOS,CPU用于运行BIOS;
独立处理单元用于通过算法至少获取第一阈值以及第二阈值,第一阈值以及第二阈值存储于独立处理单元;
BIOS用于基于第一阈值确定发生连续的可纠正错误CE;
BIOS还用于累计连续的CE的次数;
BIOS还用于基于连续的CE的次数以及第二阈值停止CE中断上报,CE中断用于通告发生CE。
在本申请的实施方式中,独立处理单元通过算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,BIOS基于第一阈值以及第二阈值停止CE中断的持续上报能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,同时能适用更多的应用环境,增加本方案的适用范围。且BIOS存储于存储芯片,增加了方案的多样性。
第四方面,提供另一种计算设备,包括CPU、独立处理单元以及BMC芯片;
独立处理单元用于通过算法至少获取第一阈值以及第二阈值,第一阈值以及第二阈值存储于独立处理单元;
BMC芯片用于基于第一阈值确定发生连续的可纠正错误CE;
BMC芯片还用于累计连续的CE的次数;
BMC还用于基于连续的CE的次数以及第二阈值停止CE中断上报,CE中断用于通告发生CE。
在本申请的实施方式中,独立处理单元通过算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,BMC基于第一阈值以及第二阈值停止CE中断的持续上报能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,同时能适用更多的应用环境,增加本方案的适用范围,且体现了方案的灵活性。
第五方面,提供另一种计算设备,可以包括处理器,该处理器与存储器耦合,其中存储器用于存储指令,处理器用于执行存储器中的指令使得该计算设备执行本申请实施例第一方面或第一方面任意一种可能实现方式所描述的方法。
第六方面,提供另一种计算设备,包括处理器,用于执行存储器中存储的计算机程序(或计算机可执行指令),当计算机程序(或计算机可执行指令)被执行时,使得执行如第一方面及第一方面各个可能的实现方式中的方法。
在一种可能的实现中,处理器和存储器集成在一起;
在另一种可能的实现中,上述存储器位于该计算设备之外。
该计算设备还包括通信接口,该通信接口用于该计算设备与其他设备进行通信,例如数据和/或信号的发送或接收。示例性的,通信接口可以是收发器、电路、总线、模块或其它类型的通信接口。
第七方面提供一种计算机可读存储介质,包括计算机可读指令,当计算机可读指令在计算机上运行时,使得本申请实施例第一方面或第一方面任一种可能实现方式。
第八方面,提供一种计算机程序产品,包括计算机可读指令,当计算机可读指令在计算机上运行时,使得本申请实施例第一方面或第一方面任一种可能实现方式。
附图说明
图1为本申请实施例提供的计算设备的一个架构示意图;
图2为本申请实施例提供的硬件故障上报的处理方法的一个示意图;
图3为本申请实施例提供的应用场景的一个示意图;
图4为本申请实施例提供的应用场景的另一个示意图。
具体实施方式
本申请实施例提供了一种硬件故障上报的处理方法及其相关设备,应用于硬件领域中。该硬件故障上报的处理方法能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,且能适用于不同能力的故障诊断系统,适用于多种应用场景。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分 方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
本申请实施例涉及了许多关于CE故障的相关知识,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。
故障中断:当计算设备的硬件结构发生CE故障时,向BIOS或OS发送一个信号即故障中断,用于告知BIOS或OS该硬件结构有CE故障发生,BIOS接收到该故障中断后感知到该硬件结构发生CE故障,并从该硬件结构获取CE故障的相关故障信息。
CE风暴:表示在一定时间段内,发生一定数量的CE故障。
CE风暴抑制:计算机系统中一种CE风暴的处理机制。具体的,出现CE风暴后,采取关闭故障中断停止中断向BIOS上报。
CE风暴抑制解除:计算机系统中一种CE风暴的处理机制。具体的,在CE风暴抑制一定时间后,重新打开故障中断,当出现CE故障时,仍然发送故障中断即为解除。
独立处理单元:独立于主处理器例如CPU的处理单元,不受主处理器上的任务影响,可以用于监视主处理器。
故障诊断系统:当硬件结构持续产生CE故障且累积后可能会变成不可纠正故障,而故障诊断系统可以通过获取的CE故障的故障信息提前识别出可能出现不可纠正故障的硬件结构,进而替换这些硬件结构,保证计算设备正常运行业务。示例性的,故障诊断系统可以是带内OS系统或带外基板管理控制器(baseboard management controller,BMC)系统,具体此处不做限定。
在介绍本申请实施例之前,先对目前内存镜像模式中出现硬失效类CE的处理方式进行简单说明,以便于后续理解本申请实施例。
当前计算设备中的中央处理器(central processing unit,CPU)、内存、高速串行计算机扩展总线标准(peripheral component interconnect express,PCIE)设备或其他硬件部件出现CE故障时,其发生故障的硬件结构可以自身修复故障,且向BIOS发送故障中断,用于通知BIOS有CE故障发生,然后BIOS从发生CE故障的硬件结构获取相关的故障信息,例如故障地址,故障状态等等。而BIOS接收到故障中断后,从硬件结构获取故障信息的优先级高于服务OS运行的正常业务的优先级,此时会暂停服务于OS运行的正常业务。因此当CE故障发生过多时,过多的故障中断会导致计算设备正常运行的业务暂停运行的时间段过长,且频繁出现暂停运行,进而导致正常业务的运行出现卡顿或者宕机,影响正常业务的运行。因此当发现CE故障过多时,需要抑制故障中断的上报,但是故障中断关闭后,用于定位分析故障问题的故障诊断系统就无法感知CE数据即CE故障信息,由此无法通过CE数据识别出可能会发生不可纠正故障的硬件结构,进而无法替换发生故障的硬件结构,因此需要再次把故障中断开启。
一些实施例中,通过BIOS中的四个阈值数据确定CE风暴的发生,是否执行CE风暴抑制和解除。示例性的,设置BIOS中的四个阈值数据的固定值分别为x,y,z以及w,其中,x为相邻的两次CE故障的时间间隔的阈值,即两次CE故障发生的时间间隔不大于x,则确定该两次CE故障是连续的CE故障。y为连续的CE故障的次数的阈值,即当相邻两次CE故障的时间间隔小于或等于x,则持续累计CE故障次数,而当相邻的两次CE故障的时 间间隔大于x,则清零累计的CE故障次数,且当累计的CE故障次数大于y时,则BIOS确定出现CE故障风暴,BIOS执行CE风暴抑制即不再发送中断上报CE故障。Z为解除CE风暴抑制的时间间隔的阈值,即当执行CE风暴抑制的时间间隔达到z后,BIOS解除执行的CE风暴抑制,继续发送中断上报CE故障,以使得故障诊断系统获取CE数据。
示例性的,BIOS执行风暴抑制且通知独立处理单元计时,当计时达到z后,独立处理单元通知BIOS解除抑制。w为反复执行CE风暴抑制以及解除的次数的阈值,即若BIOS反复抑制且解除的次数达到w后,BIOS永久抑制CE风暴,永久不再发送中断上报CE故障,即故障诊断系统不再获取CE数据。
但是,BIOS中的多个阈值数据是固定的,且该BIOS不支持更改阈值数据,而不同的硬件系统的性能以及适用的应用场景不一样,对CE故障的容忍度也存在差异,则无法调整相关阈值数据适应更多的硬件系统。且也不能适应不同的能力的故障诊断系统,例如当故障诊断系统升级后需要采集的CE数据需求增多,而无法修改相关阈值数据(例如z以及w),则该处理机制不适应不同能力的故障诊断系统,应用场景单一。另外,由于BIOS芯片与独立设计单元是耦合的,则版本更新需要同步修改与独立处理单元的耦合方式,维护工作量较大。
为解决上述问题,本申请实施例提供了一种硬件故障上报的处理方法及其相关设备,应用于硬件领域中。其中,计算设备通过独立处理单元的算法至少获取第一阈值以及第二阈值,其第一阈值以及第二阈值存储于独立处理单元,然后计算设备基于第一阈值确定发生连续的可纠正错误CE,且计算设备累计连续的CE的次数,并且基于连续的CE的次数以及第二阈值停止CE中断上报,该CE中断用于通告CE。计算设备通过独立处理单元的算法获取第一阈值以及第二阈值能支持实时修改各个阈值数据,能尽可能的降低CE故障过多时,故障中断的上报对正常业务的影响,同时能适用更多的应用环境,增加本方案的适用范围。
首先,示例性的,为便于理解后续实施例,先对应用本申请实施例提供的一个计算设备的架构进行简单说明。具体请参阅图1,图1为本申请实施例提供的计算设备的一个架构示意图,其中计算设备的硬件包括:
CPU101、存储芯片103以及内存102,其中内存102用于存储数据,其中CPU101上可以运行OS,且BIOS存储于CPU101或存储芯片103中,其BIOS由CPU101运行。一种可能的实现方式中,计算设备还包括BMC芯片105,其存储了BMC系统。当计算设备中的硬件组成结构发生故障时,会向BIOS发送故障中断告知BIOS,然后BIOS采集故障信息,并上报给故障诊断系统即OS和/或带外BMC系统,然后OS和/或带外BMC系统根据故障信息进行故障诊断,进而替换故障的硬件结构,确保计算设备的正常运行。
具体的,其中BIOS可以从计算设备的硬件结构发生的故障进行故障日志采集,并进行CE风暴判定/抑制/解除。OS主要包括主要业务以及进程的运行,还可以作为故障诊断系统从BIOS获取故障信息进行故障分析以及故障诊断。BMC可以作为带外故障诊断系统从BIOS获取故障信息进行故障分析以及故障诊断,可选的,BMC还能实现CE风暴控制。
示例性的,当计算设备的硬件组成结构例如CPU、内存或PCIE设备等硬件结构发生CE故障时,该CE故障硬件结构可以自身修复,且硬件结构发送CE中断即故障中断通知BIOS出现CE故障,该CE中断为用于通知该硬件结构发生CE故障的中断,且BIOS基于该CE中 断获取故障信息,并上报故障信息给OS或BMC进行故障诊断等操作。
一种可能的实现方式中,该计算设备的硬件结构发生CE故障时,该硬件结构可以向OS发送CE中断,以通知OS该硬件结构发生CE故障,其中,该OS具备BIOS获取CE中断的功能。
一种可能的实现方式中,该计算设备的硬件结构发生CE故障时,该硬件结构还可以向BMC发送CE中断,以通知BMC该硬件结构发生CE故障,其中,该BMC也具备前述BIOS获取CE中断的功能。
该计算设备还包括独立处理单元104。独立处理单元104用于获取与场景策略对应的第一阈值、第二阈值、第三阈值和/或第四阈值等阈值数据。一种可能的实现方式中,独立处理单元104可以基于CPU的占用率以及故障诊断系统获取故障信息的能力或需求获取各个阈值数据,或者独立处理单元104可以获取用户基于场景策略定义的各个阈值数据,具体如下述各个方法实施例所示,具体此处不再赘述。
具体的,CPU101运行的BIOS接收到CE中断后,从独立处理单元104获取第一阈值以及第二阈值,并基于第一阈值判断是否发生连续的CE故障,并对累计连续的CE故障的次数,当达到累计次数达到第二阈值时,确定发生CE风暴,并停止CE中断的上报。独立处理单元104还用于基于第三阈值以及CE中断停止上报的时长继续CE中断的上报。
在本申请的实施方式中,各个阈值数据由独立处理单元获取,硬件结构发生CE故障时,BIOS接收CE中断后,从独立处理单元获取第一阈值以及第二阈值,当发生连续CE故障的次数达到第二阈值时,确定发生CE风暴,并停止CE中断上报进行CE风暴抑制,当BIOS停止CE中断上报时并通知独立处理单元计时停止CE中断上报的时间段,当达到第三阈值时,解除CE风暴抑制即打开CE中断的上报。因此当判断发生CE风暴时,进行CE风暴抑制能避免过多的CE中断的上报对OS运行的正常业务产生影响,避免正常业务的运行出现卡顿或宕机,使得OS运行的正常业务正常运行。且其中各个阈值数据由独立处理单元基于场景策略实时动态获取,能根据应用场景的变换实时适应,进而能够适应更多的应用场景。
需要说明的是,图1所示的计算设备可以是服务器、个人电脑、计算机、集群服务器、车载计算设备、平板、存储系统等等计算设备,可以理解的是,在实际应用场景中,还可以是其他的计算设备,具体此处不做限定。
基于前述图1的计算设备作为示例,为了更好的理解本申请的实施例,下面结合附图,对本申请的实施例提供的一种硬件故障上报的处理方法进行详细描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。具体请参阅图2,图2为本申请实施例提供的硬件故障上报的处理方法的一个示意图,具体包括:
201、独立处理单元获取与场景策略对应的阈值数据(x、y、z)。
不同的应用场景或者当前应用场景中的故障诊断系统的能力发生变化,其对应的阈值数据(x、y、z)不同,其中,x为相邻的两次CE故障的时间间隔的阈值,y为连续的CE故障的次数的阈值,Z为解除CE风暴抑制的时间间隔的阈值,具体与前述所述的类似,具体此处不再赘述。示例性的,当故障诊断系统需要获取较多的CE故障信息时,或者硬件结构的能力较强时,可以在一定范围内调大y的值,或者当前计算设备的业务强度大,即占 用率高,过多的CE中断容易影响正常业务的运行,y的值可以适当的降低,具体此处不做限定。
一种可能的实现方式中,独立处理单元可以是计算设备的软件实现的。示例性的,上述图1所示的BMC可以作为独立处理单元,可以获取与场景策略对应的阈值数据(x、y、z),其该阈值数据(x、y、z)可以是用户根据当前应用场景实时修改的,或者该阈值数据(x、y、z)由算法根据当前计算设备的设备参数动态调整得到的,具体此处不做限定。实时根据应用场景策略获取对应的各个阈值数据,能适应较多的应用场景。且具体获取阈值数据(x、y、z)的实现方式在后续的示例中进行说明,具体此处不再赘述。
一种可能的实现方式中,独立处理单元还获取与场景策略对应的阈值数据w,w为永久抑制标识。示例性的,当某个硬件结构频繁出现CE故障风暴,并多次进行风暴抑制以及风暴抑制解除,当其风暴抑制解除的次数满足w,则触发永久抑制标识w有效,则此时再次产生CE风暴,并对CE风暴进行抑制后,不会再下发CE风暴抑制解除任务,该CE风暴抑制永久生效。独立处理单元还获取阈值数据w,可以根据阈值数据w实现CE风暴抑制永久生效,当发生CE故障的硬件结构不对计算设备的正常业务产生影响时,通过阈值数据w实现永久抑制能更有效的隔离硬件结构上报的CE中断,进而彻底避免其不断上报的CE中断对正常业务的影响,使得OS业务的正常运行。
另外,示例性的,例如当某个硬件结构较为重要,需要实时关注其状态,因此无需实施永久抑制,可以将w的值尽可能的调大例如无穷大,或者某个硬件结构的CE故障在当前应用场景中对计算设备不会产生过多的影响,也可以实现永久抑制,避免过多的CE中断对计算设备正常业务的运行产生影响。
另外,一种可能的实现方式中,示例性的,如前述图1中所示,计算设备的独立处理单元可以是智能管理单元(inertial measurement unit,IMU)或管理引擎(management engine,ME),可以理解的是,独立处理单元还能是其他的硬件结构,具体此处不做限定。其硬件实现的独立处理单元可以根据自身的数据串口获取用户输入的阈值数据(x、y、z),或者根据自身内部的算法得到当前应用场景中计算设备对应的阈值数据(x、y、z)。
示例性的,例如独立处理单元基于CPU的占用率对阈值数据(x、y)进行动态调整,例如CPU占用率为50%时,固定x=60秒,y=10次,当占用率每增加10%,独立处理单元调整x=x*2,y=y/2,占用率每减少10%,调整x=x/2,y=y*2。具体的,例如,当CPU的占用率下降到40%时,当前的x=60秒/2=30秒,y=10*2=20次,即独立处理单元基于CPU的占用率动态调整阈值数据x为30秒,y为10次。或当CPU的占用率从50%上升到60%的情况下,当前的x=60秒*2=120秒,y=10/2=5次,即独立处理单元基于CPU的占用率动态调整阈值数据x为120秒,y为5次。
在本申请的实施方式中,独立处理单元获取基于CPU占用率得到的对应的阈值数据(x,y),独立处理单元能实时动态的自动生成对应当前应用场景的阈值数据(x,y),能实时适应应用场景,提升方案的灵活性。
另外,一种可能的实现方式中,独立处理单元还可以基于故障诊断系统对故障预测的能力动态调整阈值数据(z,w)。示例性的,当故障诊断系统进行升级或更新后,故障诊断系统对故障信息的需求发生变化,独立处理单元基于故障诊断系统当前对故障信息的需求确定阈值数据(z,w)。例如当故障诊断系统的能力不变的情况下,计算设备启动运行的第 一个小时,z=10分钟,w=30次,每超过一小时,z=z*2即第二个小时内z=20分钟,随着时间的变化以此类推,三天未收到CE故障中断,则恢复z=10分钟,w=30次。当故障诊断系统能力升级,需要获取较多的CE故障信息时,独立处理单元基于故障诊断系统对故障预测的能力调整阈值数据(z,w),例如计算设备启动运行的第一个小时,固定z=10分钟,w=无穷大,每超过一小时,z=z*2,三天未收到CE故障中断,则恢复z=10分钟,w=无穷大。
在本申请的实施方式中,独立处理单元基于故障诊断系统对故障阈值的能力动态调整阈值数据(z,w),可以适应不能能力的故障诊断系统的需求,例如可以支持故障诊断系统获取更多的故障信息等。
且独立处理单元获取阈值数据(x、y、z)并将其存储在自身的存储区域。具体如步骤202所示:
202、存储阈值数据(x、y、z)。
独立处理单元获取阈值数据(x、y、z)并将其存储在自身的存储区域。可以从独立处理单元的存储区域读取阈值数据(x、y、z)实现后续操作。
当计算设备的硬件结构发生CE故障时,计算设备的BIOS或BMC或OS可以从独立处理单元获取阈值数据(x、y、z)判定是否发生CE风暴,且当确定CE风暴发生时,执行CE风暴抑制,以及解除CE风暴抑制,具体如下步骤所述:
203、硬件结构发生CE故障,发送CE中断。
计算设备的硬件结构发生CE故障,此时硬件结构会向BIOS发送CE中断,以此使得BIOS感知硬件结构发生CE故障。
在一种可能的实现方式中,硬件结构还可以向OS或BMC发送CE中断,用于向OS或BMC通告该硬件结构发生CE故障。该OS或BMC支持获取CE中断,以及修改该CE中断的开关。
BIOS接收到CE中断以后,基于CE中断确定是否发生CE风暴,并采取对应的措施,具体请参阅如下步骤204-步骤208:
204、BIOS获取阈值数据(x,y)。
BIOS从独立处理单元获取阈值数据(x,y)。
具体的,BIOS可以从独立处理单元的存储区域获取阈值数据(x,y)。
205、BIOS基于(x,y)确定CE风暴。
具体的,当BIOS接收到两次CE中断的时间间隔不大于x,则确定该两次CE中断为连续CE中断,并记录该连续CE中断的数量。当连续CE中断的数量达到y,则确定发生CE风暴。当连续CE中断的数量小于y,则确定未发生CE风暴。当未连续CE中断的数量未达到y的期间,某两次CE中断的时间间隔大于x,则清零记录的连续CE中断的数量。
206、BIOS执行CE风暴抑制。
BIOS确定发生CE风暴后,执行CE风暴抑制。
示例性的,BIOS确定发生CE风暴后,关闭CE中断上报的开关,禁止计算设备的硬件结构向BIOS发送CE中断,以此避免CE中断的持续上报影响正常业务的运行。
且,BISO执行CE风暴抑制时,还会向独立处理单元通告该CE风暴抑制的执行,使得独立处理单元基于阈值数据z进行风暴抑制计时。具体如步骤207所述:
207、独立处理单元进行CE风暴抑制计时。
当独立处理单元在BIOS执行CE风暴抑制的,独立处理单元会获取阈值数据z,进行风暴解除时间计时。示例性的,独立处理单元可以对z递减计数,当z递减到0时,独立处理单元向BIOS发送CE风暴抑制解除任务BIOS,或者,独立处理单元可以启动计时,当计时数值达到阈值数据z的时候,独立处理单元会向BIOS发送CE风暴抑制解除任务。可以理解的是,在实际情况中,独立处理单元基于阈值数据z进行CE风暴抑制计时的实现方式还可以是其他实现方式,具体此处不做限定。
在本申请的实施方式中,独立处理单元基于阈值数据z进行CE风暴抑制解除时间计时,可以有效的解除CE风暴抑制,恢复CE中断的上报,使得BIOS能继续获取故障信息,尽可能的保障对计算设备的故障的实时管理。
208、BIOS解除CE风暴抑制。
当独立处理单元基于阈值数据z确定执行CE风暴抑制的时长达到阈值z的时候,BIOS会接收到独立处理单元下发的CE风暴抑制解除任务,执行CE风暴抑制解除。解除抑制后,当有CE故障发生时,硬件结构仍然能向BIOS发送CE中断,告知BIOS有CE故障产生,使得后续能基于CE中断正常执行对应的操作,例如获取CE故障信息,以及故障诊断系统基于CE故障信息对CE故障进行诊断、定位分析等等。
一种可能的实现方式中,当BIOS执行CE风暴抑制解除的次数达到阈值数据w时,BIOS执行CE风暴永久抑制。具体如下步骤209所述:
209、BIOS执行永久CE风暴抑制。
具体的,独立处理单元对BIOS执行CE风暴抑制解除的次数进行计数,当计数值达到阈值数据w时,则永久抑制标识w有效,则独立处理单元不会下发CE风暴抑制解除任务给BIOS,以此实现了永久CE风暴抑制。
在本申请的实施方式中,通过执行永久CE风暴抑制,可以有效的避免高概率发生CE故障的硬件结构经常上报CE中断,且CE故障可以进行自愈,进而避免了CE中断的不断上报影响正常业务的运行,且避免一直执行CE风暴抑制以及解除,减少计算设备的工作负担。
需要说明的是,前述图2仅仅作为一个示例用于理解本申请实施例,不对本方案产生实质性的限定,可以理解的是,本方案前述图2中BIOS执行的操作也可以由OS或BMC执行,如前述所述OS或BMC支持获取CE中断,并支持修改CE中断上报的开关即可,或还可以通过其他方式实现,具体此处不做限定。
在本申请实施例中,通过独立处理单元获取得到的阈值数据对CE故障风暴进行抑制以及抑制解除,且能根据应用环境实时更换阈值,适用于较多的应用场景,还能自适应修改阈值数据,提高方案的灵活性,适用于不同工作能力的故障诊断系统。
为了便于理解本申请实施例,下面对两个应用场景的示例进行说明。首先具体请参阅图3,图3为本申请实施例提供的应用场景的一个示意图。
其中,计算设备以携带X86CPU的服务器作为示例进行说明,由于X86CPU内部没有独立于主CPU的独立处理单元,因此,图3中以BMC作为独立处理单元获取阈值数据和CE风暴抑制计时等操作。同时基于BMC的带外管理,用户可以基于带外交互修改阈值数据,且BIOS从BMC读取阈值数据判定CE风暴并执行CE风暴抑制以及解除等操作。具体实现 过程如下:
301、BMC通过网页获取阈值数据。
示例性的,用户可以通过网页即输入界面或输入程序基于带外交互在BMC中输入阈值数据(x,y,z和/或w),其阈值数据(x,y,z和/或w)具体的值可以由用户基于应用环境实时修改,具体此处不做限定。具体的,在本申请的实施方式中,用户可以根据计算设备系统对CE风暴容忍调整阈值数据,或者BMC基于故障诊断系统(即BMC或OS)获取CE故障信息需求的变化动态调整阈值数据,以此实现本方案适用于多种应用场景。具体实现如前述图2中所述的类似,具体此处不再赘述。
302、硬件结构发生CE故障。
硬件结构检测到自身发生CE故障,会对CE故障进行修复,且执行如下步骤303向BIOS发送CE中断。
303、硬件结构向BIOS发送CE中断。
硬件结构检测到自身发生CE中断时,向BIOS发送CE中断,使得BIOS感知到有CE故障发生。
304、BIOS从BMC获取阈值数据。
BIOS从BMC读取阈值数据(x,y),然后基于阈值数据执行步骤305判定CE风暴。
305、BIOS判定CE风暴。
示例性的,BIOS基于读取的阈值数据(x,y)判断CE风暴。具体的,当相邻两个CE中断的上报时间间隔小于x,则确定该两个CE中断为连续CE中断,并对连续CE中断进行计数。当其中相邻两个CE中断的时间间隔大于x,则清零连续CE中断的计数,而当连续CE中断的计数值达到阈值数据y,则判定当前产生CE风暴。则BISO执行如下步骤306进行CE风暴抑制。
306、BIOS执行CE风暴抑制。
当BISO判定当前状态产生CE风暴时,就执行CE风暴抑制。具体的,BIOS关闭CE中断上报的开关,禁止CE中断的上报,以此避免CE中断的持续上报,影响正常业务的运行,避免因为CE中断的不断上报造成正常业务运行卡顿或者宕机。
307、BIOS向BMC上报抑制事件。
具体的,BIOS执行CE风暴抑制时,还会向BMC上报CE风暴抑制开始执行,使得BMC进行如下步骤308:
308、BMC进行抑制计时。
具体的,BMC接收到BIOS的抑制事件上报后,BMC获取阈值数据z,并开始抑制计时,示例性的,BMC可以执行z的递减计数,当递减到0时执行步骤309;或者从0开始计时,当计时达到阈值数据z时,执行步骤309,具体此处不做限定。
309、BMC下发CE风暴抑制解除任务。
具体的,当计时满足阈值数据z的条件,则BMC向BIOS下发CE风暴抑制解除任务。
310、BIOS执行CE风暴抑制解除。
BIOS接收到BMC下发的CE风暴抑制解除任务后,就执行CE风暴抑制解除。示例性的,BIOS打开CE中断上报的开关,当有CE故障发生时,BIOS正常接收硬件结构上报的CE中断,感知CE故障的发生,或者还获取CE故障信息发送给故障诊断系统。
一种可能的实现方式中,BMC从用户那得到阈值数据w,并且当BMC下发CE风暴抑制解除的次数达到阈值数据w时,则触发阈值数据w即永久抑制标识有效,具体的,则BIOS实现永久CE风暴抑制,即后续不再打开CE中断上报的开关,永远不获取CE中断。进而可以减小计算设备的工作负担,避免高概率的执行CE风暴抑制以及CE风暴抑制解除。
需要说明的是,上述图3中BMC作为独立处理单元,作为以软件形式实现独立处理单元的示例说明了独立处理单元的具体实现,仅仅作为示例用于理解本申请实施例,不对本申请实施例产生实质性的限定,可以理解的是,在实际情况中,还可以是其他的软件实现独立处理单元,具体此处不做限定。
上述图3以软件BMC作为独立处理单元的应用场景进行说明,下面以硬件IMU作为独立处理单元的应用场景作为示例进行说明,具体请参阅图4,图4为本申请实施例提供的应用场景的另一个示意图。
其中,计算设备以包括精简指令集机器(acorn RISC machine,ARM)CPU的服务器作为示例,由于ARM CPU内部集成有独立于主CPU的IMU或ME等硬件单元,因此可以由IMU或ME获取阈值数据以及执行风暴抑制计时,且BIOS能从IMU或者ME获取阈值数据,并判定CE风暴,执行CE风暴抑制以及解除。图4所示的具体实现过程如下:
401、IMU获取阈值数据。
示例性的,用户可以通过自身的串口在IMU内输入阈值数据(x,y,z和/或w),其阈值数据(x,y,z和/或w)具体的值可以由用户基于应用环境确定后通过IMU的串口实时修改,可以理解的是,其阈值数据在各个应用环境下的值可以根据实际情况确定,具体此处不做限定。
另外,IMU还可以基于计算设备当前对CE风暴的容忍度调整阈值数据,示例性的如前述图2中所述的基于CPU的占用率动态调整阈值数据,以及基于故障诊断系统(即BMC或OS)获取CE故障信息需求的变化动态调整阈值数据,具体实现如前述图2中所述的类似,具体此处不再赘述。
在本申请的实施方式中,用户可以根据计算设备系统对CE风暴容忍调整阈值数据,或者BMC基于故障诊断系统获取CE故障信息需求的变化动态调整阈值数据,以此实现本方案适用于多种应用场景。
402、硬件结构发生CE故障。
硬件结构检测到自身发生CE故障,会对CE故障进行修复,且执行如下步骤403向BIOS发送CE中断。
403、硬件结构向BIOS发送CE中断。
404、BIOS从IMU获取阈值数据。
405、BIOS判定CE风暴。
406、BIOS执行CE风暴抑制。
407、BIOS向BMC上报抑制事件。
具体的,BIOS执行CE风暴抑制时,还会向BMC上报CE风暴抑制开始执行,使得BMC进行如下步骤308:
408、IMU进行抑制计时。
具体的,IMU接收到BIOS的抑制事件上报后,IMU基于存储的阈值数据z,开始抑制 计时。
409、BMC下发CE风暴抑制解除任务。
410、BIOS执行CE风暴抑制解除。
需要说明的是,前述步骤402至步骤410的具体实现过程与前述图3中步骤302至步骤310类似,具体此处不再赘述。
需要说明的是,前述图3以及图4仅仅作为应用场景的示例,用于理解本申请实施例,不对本申请实施例产生实质性的限定。可以理解的是,在实际情况中,BIOS侧的操作还可以由BMC或OS等实现,或独立处理单元可以是BMC或IMU或ME等单元,具体此处不做限定。
以上对本申请实施例所提供的硬件故障上报的处理方法进行了详细介绍,本文中应用了具体个例对本申请实施例的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请实施例的硬件故障上报的处理方法及其核心思想。同时,对于本领域的一般技术人员,依据本申请实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请实施例的限制。
另外,在本申请各个实施例中的各功能模块或单元可以集成在一个处理器中,且上述集成的模块或单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请实施例还提供一种计算机可读存储介质,包括计算机可读指令,当计算机可读指令在计算机上运行时,使得计算机执行如前述方法实施例所示任一项实现方式。
本申请实施例还提供的一种计算机程序产品,计算机程序产品包括计算机程序或指令,当计算机程序或指令在计算机上运行时,使得计算机执行如前述方法实施例所示任一项实现方式。
本申请实施例还提供一种芯片或芯片系统,该芯片可包括处理器。该芯片还可包括存储器(或存储模块)和/或收发器(或通信模块),或者,该芯片与存储器(或存储模块)和/或收发器(或通信模块)耦合,其中,收发器(或通信模块)可用于支持该芯片进行有线和/或无线通信,存储器(或存储模块)可用于存储程序或一组指令,该处理器调用该程序或该组指令可用于实现上述方法实施例、方法实施例的任意一种可能的实现方式中由终端或者网络设备执行的操作。该芯片系统可包括以上芯片,也可以包含上述芯片和其他分离器件,如存储器(或存储模块)和/或收发器(或通信模块)。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请实施例提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请实施例可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请实施例而言更多情 况下软件程序实现是更佳的实施方式。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例的方法。

Claims (13)

  1. 一种硬件故障上报的处理方法,其特征在于,包括:
    计算设备通过独立处理单元的算法至少获取第一阈值以及第二阈值,所述第一阈值以及所述第二阈值存储于所述独立处理单元;
    所述计算设备基于所述第一阈值确定发生连续的可纠正错误CE;
    所述计算设备累计所述连续的CE的次数;
    所述计算设备基于所述连续的CE的次数以及所述第二阈值停止CE中断上报,所述CE中断用于通告发生所述CE。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述计算设备通过所述独立处理单元的所述算法获取第三阈值,所述第三阈值存储于所述独立处理单元;
    在所述计算设备基于所述连续的CE的次数以及所述第二阈值停止CE中断上报之后,所述方法还包括:
    所述计算设备基于停止所述CE中断上报的时长以及所述第三阈值继续上报所述CE中断。
  3. 根据权利要求2所述的方法,其特征在于,所述计算设备通过独立处理单元的算法至少获取第一阈值以及第二阈值包括:
    所述计算设备通过所述独立处理单元的所述算法基于中央处理器CPU的占用率实时确定所述第一阈值以及所述第二阈值;
    所述计算设备通过所述独立处理单元的算法获取第三阈值包括:
    所述计算设备通过所述独立处理单元的算法基于故障诊断系统的能力需求确定所述第三阈值。
  4. 根据权利要求1-3所述的方法,其特征在于,所述方法还包括:
    所述计算设备通过所述独立处理单元的所述算法获取第四阈值,所述第四阈值存储于所述独立处理单元;
    在所述计算设备基于停止所述CE中断上报的时长以及所述第三阈值继续上报所述CE中断之后,所述方法还包括:
    所述计算设备累计目标次数,所述目标次数为停止所述CE中断的上报后继续上报所述CE中断的次数;
    所述计算设备基于所述目标次数以及所述第四阈值永久禁止所述CE中断的上报。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述计算设备基于所述第一阈值确定发生连续的CE包括:
    所述计算设备通过基本输入输出系统BIOS基于所述第一阈值确定发生所述连续的CE;
    所述计算设备基于所述连续的CE的次数以及所述第二阈值停止CE中断上报包括:
    所述计算设备通过所述BIOS基于所述连续的CE的次数以及所述第二阈值停止所述CE中断上报。
  6. 根据权利要求1-4中任一项所述的方法,其特征在于,所述计算设备基于所述第一阈值确定发生连续的CE包括:
    所述计算设备通过所述基板管理控制器BMC或操作系统OS基于所述第一阈值确定发 生所述连续的CE;
    所述计算设备基于所述连续的CE的次数以及所述第二阈值停止CE中断上报包括:
    所述计算设备通过所述BMC或OS基于所述连续的CE的次数以及所述第二阈值停止所述CE中断上报。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述独立处理单元为以下任意一种:
    智能管理单元IMU,管理引擎ME、BMC或OS。
  8. 一种计算设备,其特征在于,所述计算设备包括中央处理器CPU以及独立处理单元,所述CPU用于存储BIOS;
    所述独立处理单元用于通过算法至少获取第一阈值以及第二阈值,所述第一阈值以及所述第二阈值存储于所述独立处理单元;
    所述BIOS用于基于所述第一阈值确定发生连续的可纠正错误CE;
    所述BIOS还用于累计所述连续的CE的次数;
    所述BIOS还用于基于所述连续的CE的次数以及所述第二阈值停止CE中断上报,所述CE中断用于通告发生所述CE。
  9. 根据权利要求8所述的计算设备,其特征在于,所述独立处理单元,还用于通过所述算法获取第三阈值,所述第三阈值存储于所述独立处理单元;
    所述独立处理单元还用于基于停止所述CE中断上报的时长以及所述第三阈值继续上报所述CE中断。
  10. 根据权利要求9所述的计算设备,其特征在于,所述独立处理单元,具体用于通过所述算法基于所述CPU的占用率实时确定所述第一阈值以及所述第二阈值;
    所述独立处理单元,具体用于通过算法基于故障诊断系统的能力需求确定所述第三阈值。
  11. 根据权利要求8-10所述的计算设备,其特征在于,所述独立处理单元,还用于通过所述算法获取第四阈值,所述第四阈值存储于所述独立处理单元;
    所述独立处理单元还用于累计目标次数,所述目标次数为停止所述CE中断的上报后继续上报所述CE中断的次数;
    所述独立处理单元还用于基于所述目标次数以及所述第四阈值永久禁止所述CE中断的上报。
  12. 一种计算设备,其特征在于,所述计算设备包括中央处理器CPU、独立处理单元以及存储芯片,所述存储芯片用于存储BIOS,所述CPU用于运行所述BIOS;
    所述独立处理单元用于通过算法至少获取第一阈值以及第二阈值,所述第一阈值以及所述第二阈值存储于所述独立处理单元;
    所述BIOS用于基于所述第一阈值确定发生连续的可纠正错误CE;
    所述BIOS还用于累计所述连续的CE的次数;
    所述BIOS还用于基于所述连续的CE的次数以及所述第二阈值停止CE中断上报,所述CE中断用于通告发生所述CE。
  13. 一种计算设备,其特征在于,所述计算设备包括中央处理器CPU、独立处理单元以及BMC芯片;
    所述独立处理单元用于通过算法至少获取第一阈值以及第二阈值,所述第一阈值以及所述第二阈值存储于所述独立处理单元;
    所述BMC芯片用于基于所述第一阈值确定发生连续的可纠正错误CE;
    所述BMC芯片还用于累计所述连续的CE的次数;
    所述BMC还用于基于所述连续的CE的次数以及所述第二阈值停止CE中断上报,所述CE中断用于通告发生所述CE。
PCT/CN2023/104312 2022-09-28 2023-06-29 一种硬件故障上报的处理方法及其相关设备 WO2024066589A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211192364.5 2022-09-28
CN202211192364.5A CN117785521A (zh) 2022-09-28 2022-09-28 一种硬件故障上报的处理方法及其相关设备

Publications (1)

Publication Number Publication Date
WO2024066589A1 true WO2024066589A1 (zh) 2024-04-04

Family

ID=90385668

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104312 WO2024066589A1 (zh) 2022-09-28 2023-06-29 一种硬件故障上报的处理方法及其相关设备

Country Status (2)

Country Link
CN (1) CN117785521A (zh)
WO (1) WO2024066589A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008027284A (ja) * 2006-07-24 2008-02-07 Nec Corp 障害処理システム、障害処理方法、障害処理装置およびプログラム
CN110674005A (zh) * 2019-08-30 2020-01-10 苏州浪潮智能科技有限公司 一种监控服务器内存的方法、设备及可读介质
CN111008091A (zh) * 2019-12-06 2020-04-14 苏州浪潮智能科技有限公司 一种内存ce的故障处理方法、系统及相关装置
CN111104238A (zh) * 2019-10-30 2020-05-05 苏州浪潮智能科技有限公司 一种基于ce的内存诊断的方法、设备及介质
CN112306732A (zh) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 一种服务器中的自动纠错控制方法、装置、设备及介质
CN114911659A (zh) * 2022-05-20 2022-08-16 深信服科技股份有限公司 一种ce风暴抑制方法、装置及相关设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008027284A (ja) * 2006-07-24 2008-02-07 Nec Corp 障害処理システム、障害処理方法、障害処理装置およびプログラム
CN110674005A (zh) * 2019-08-30 2020-01-10 苏州浪潮智能科技有限公司 一种监控服务器内存的方法、设备及可读介质
CN111104238A (zh) * 2019-10-30 2020-05-05 苏州浪潮智能科技有限公司 一种基于ce的内存诊断的方法、设备及介质
CN111008091A (zh) * 2019-12-06 2020-04-14 苏州浪潮智能科技有限公司 一种内存ce的故障处理方法、系统及相关装置
CN112306732A (zh) * 2020-11-19 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 一种服务器中的自动纠错控制方法、装置、设备及介质
CN114911659A (zh) * 2022-05-20 2022-08-16 深信服科技股份有限公司 一种ce风暴抑制方法、装置及相关设备

Also Published As

Publication number Publication date
CN117785521A (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
US11119874B2 (en) Memory fault detection
US7536584B2 (en) Fault-isolating SAS expander
JP2006259869A (ja) マルチプロセッサシステム
WO2016165304A1 (zh) 一种实例节点管理的方法及管理设备
US8667337B2 (en) Storage apparatus and method of controlling the same
JP2007109238A (ja) 回復可能なエラーのロギングのためのシステム及び方法
JP2009540436A (ja) 障害を分離するsasエクスパンダ
US9164823B2 (en) Resetting a peripheral driver and prohibiting writing into a register retaining data to be written into a peripheral on exceeding a predetermined time period
CN113176963B (zh) 一种PCIe故障自修复方法、装置、设备及可读存储介质
EP2518627A2 (en) Partial fault processing method in computer system
US20140122421A1 (en) Information processing apparatus, information processing method and computer-readable storage medium
CN114328102A (zh) 设备状态监控方法、装置、设备及计算机可读存储介质
CN112631820A (zh) 软件系统的故障恢复方法及装置
CN115617550A (zh) 处理设备、控制单元、电子设备、方法和计算机程序
US20140298076A1 (en) Processing apparatus, recording medium storing processing program, and processing method
JP2010160660A (ja) ネットワークインタフェース、計算機システム、それらの動作方法、及びプログラム
WO2024066589A1 (zh) 一种硬件故障上报的处理方法及其相关设备
JP2015197732A (ja) 情報処理装置、情報処理装置の制御方法及び情報処理装置の制御プログラム
WO2008004330A1 (fr) Système à processeurs multiples
US11704180B2 (en) Method, electronic device, and computer product for storage management
JP3447347B2 (ja) 障害検出方法
JP2007028118A (ja) ノード装置の故障判断方法
CN109062718B (zh) 一种服务器及数据处理方法
JP6133614B2 (ja) 障害ログ採取装置、障害ログ採取方法、及び、障害ログ採取プログラム
CN115202803A (zh) 一种故障处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869837

Country of ref document: EP

Kind code of ref document: A1