CN111045844A

CN111045844A - Fault degradation method and device

Info

Publication number: CN111045844A
Application number: CN201911086234.1A
Authority: CN
Inventors: 刘波; 王友富
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-21

Abstract

The invention discloses a fault degradation method and a device, comprising the following steps: continuously detecting whether hardware faults exist in first-class hardware of the server by using an interrupt mode to obtain first-class hardware information; continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information; continuously detecting the working state of the server to obtain working state information; and determining a fault state according to the first class of hardware information, the second class of hardware information and the working state information, and correspondingly generating a fault degradation execution strategy to execute fault degradation. The invention can degrade the running server when the hardware fails, maintain the server to continuously process important work, and avoid file damage and data loss.

Description

Fault degradation method and device

Technical Field

The present invention relates to the field of computers, and more particularly, to a method and an apparatus for fault degradation.

Background

The hardware redundancy technology is one of the most common and basic server technologies, and is also the most widely applied server general technology. The method is characterized in that duplicate or more identical hardware is provided, and the hardware is enabled to be in a standby or working state at any moment through corresponding technology, and the rest parts can continue to work even after the parts fail, so that the server keeps running uninterruptedly for a long time. The redundancy technology enables the server to continuously run under the condition of partial hardware failure in most cases, and the availability of the server is improved.

If the server system continues to maintain the original performance operation under the condition of certain hardware faults, downtime may be caused. Examples of such fault conditions are as follows: if the number of the fans exceeds a certain number, the heat dissipation capacity is reduced, and if the heat dissipation capacity is serious, the server is overheated and crashed; the temperature of the machine room rises due to the fault of a refrigerating system of the machine room, the cooling capacity of air cooling is reduced, and the overheating and downtime of a server are caused when the temperature is serious; and when the power supply has a redundant fault, the power supply cannot support the normal operation power consumption of the current system, so that overcurrent protection is down.

Aiming at the problem that hardware failure in the prior art may cause downtime, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for degrading a failure, which can degrade an operating server when a hardware failure occurs, maintain the server to continue processing important work, and avoid file damage and data loss.

In view of the foregoing, a first aspect of the embodiments of the present invention provides a failure degradation method, including the following steps executed by a baseboard management controller:

continuously detecting whether hardware faults exist in first-class hardware of the server by using an interrupt mode to obtain first-class hardware information;

continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information;

continuously detecting the working state of the server to obtain working state information;

and determining a fault state according to the first class of hardware information, the second class of hardware information and the working state information, and correspondingly generating a fault degradation execution strategy to execute fault degradation.

In some embodiments, the hardware failure of the first type of hardware comprises a first type of failure capable of immediate effect affecting the server, the first type of hardware comprising at least one of: power overload, CPU failure, memory error.

In some embodiments, the hardware failure of the second type of hardware comprises a second type of failure capable of delaying validation affecting the server, the second type of hardware comprising at least one of: ambient temperature changes, board failures, hard disk failures.

In some embodiments, generating the failure degradation execution policy based on the first type of hardware information, the second type of hardware information, and the operational status information comprises:

adding instructions degraded at a first strength to an execution strategy in response to the first type hardware information declaring that the first type hardware has a first type fault;

adding an instruction degraded with a second strength into the execution strategy in response to the second type of hardware information declaring that the second type of hardware has a second type of fault and the working state information declares that working environment parameters related to the second type of fault are outside a normal interval, wherein the second strength is less than or equal to the first strength;

and responding to the second type hardware information to declare that the second type hardware has a second type fault and the working state information to declare that the working environment parameters related to the second type fault are within a normal interval, and not adding instructions into the execution strategy.

In some embodiments, performing the failure degradation comprises:

determining the degradation strength according to the sum or the maximum value of the strengths of all the instructions in the execution strategy;

the central processing unit is controlled to reduce the working frequency through the general input/output bus, and the reduction degree of the working frequency is positively related to the degradation strength.

A second aspect of an embodiment of the present invention provides a fault degradation apparatus, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed by the baseboard management controller performing the steps of:

In some embodiments, performing the failure degradation comprises:

The invention has the following beneficial technical effects: according to the fault degradation method and device provided by the embodiment of the invention, whether hardware faults exist in the first type of hardware of the server is continuously detected by using an interrupt mode, so that the first type of hardware information is obtained; continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information; continuously detecting the working state of the server to obtain working state information; the technical scheme of determining the fault state according to the first type of hardware information, the second type of hardware information and the working state information and correspondingly generating a fault degradation execution strategy to execute fault degradation can degrade and run the server when the hardware fails, maintain the server to continuously process important work and avoid file damage and data loss.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a fault degradation method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above object, a first aspect of the embodiments of the present invention provides an embodiment of a failure degradation method capable of degrading an operating server when a hardware fails. Fig. 1 is a flow chart illustrating a fault degradation method provided by the present invention.

The failure degradation method, as shown in fig. 1, includes the following steps executed by a baseboard management controller:

step S101: continuously detecting whether hardware faults exist in first-class hardware of the server by using an interrupt mode to obtain first-class hardware information;

step S103: continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information;

step S105: continuously detecting the working state of the server to obtain working state information;

step S107: and determining a fault state according to the first class of hardware information, the second class of hardware information and the working state information, and correspondingly generating a fault degradation execution strategy to execute fault degradation.

The scheme provided by the invention is that a BMC (baseboard management controller) system of the server can monitor hardware faults in time and inform the server of degraded operation through early warning. The server with degraded operation can run for a longer time, and can complete the storage work of important data even if the server is finally down.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

In some embodiments, performing the failure degradation comprises:

The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.

The following further illustrates embodiments of the invention in terms of specific examples.

Preferably, the detection is performed according to the type of hardware failure of the server. The fault detection is performed by the BMC system of the server. For faults (first-class faults) with serious influence on the system, such as power overload, CPU faults, memory errors and the like, the system operation can be quickly influenced, so the interrupt mode is used for detection, and the BMC system can acquire hardware faults and transmit fault information within millisecond time. For the faults (second type faults) which influence the system in a delaying way, such as the ambient temperature change caused by the damage of a fan and the damage of an air conditioner in a machine room, the faults of a board card, the faults of a hard disk and the like, the faults can be obtained in a polling mode, and the obtained information is confirmed for multiple times so as to prevent false alarm.

And then deciding whether to execute degradation according to specific conditions and hardware failure information. For example, if the air conditioner in the machine room is damaged and the ambient temperature exceeds the temperature requirement for normal operation of the server, if the server continues to operate, overheating and downtime are probably caused, and appropriate degradation is needed at this time; whereas if the ambient temperature is still within the normal range, degradation is not needed for the moment. For example, the power supply triggers an overload protection signal, the power consumption of the whole server is too high, and the whole server needs to be degraded immediately at this time, otherwise, the whole server is easy to power down. As another example, if a fan with 3+1 redundancy is damaged, heat dissipation is not affected if the fan is damaged by one, and if the fan damage equals more than 2, degradation needs to be performed.

And finally triggering the degradation operation by the BMC according to the degradation decision result. The degradation trigger is implemented by GPIO (general purpose input output bus), and the server hardware system and the server system software monitor changes in the GPIO. The hardware degradation mainly comprises CPU frequency reduction, and the BMC is directly connected with a pin of the CPU through the GPIO to trigger the CPU frequency reduction. CPU down-conversion will directly result in a reduction in heat dissipation and power consumption. And the system software processing module is used for learning that the server generates serious hardware fault early warning after detecting the GPIO signal, and correspondingly starting degradation processing and saving data to prevent data loss.

It can be seen from the foregoing embodiments that, in the failure degradation method provided in the embodiments of the present invention, whether hardware failure exists in the first type of hardware of the server is continuously detected by using an interrupt manner, so as to obtain information about the first type of hardware; continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information; continuously detecting the working state of the server to obtain working state information; the technical scheme of determining the fault state according to the first type of hardware information, the second type of hardware information and the working state information and correspondingly generating a fault degradation execution strategy to execute fault degradation can degrade and run the server when the hardware fails, maintain the server to continuously process important work and avoid file damage and data loss.

It should be particularly noted that the steps in the embodiments of the fault degradation method described above can be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.

In view of the above object, a second aspect of the embodiments of the present invention provides an embodiment of a failure degradation apparatus capable of degrading an operating server when a hardware failure occurs. The fault degradation apparatus includes:

a processor; and

In some embodiments, performing the failure degradation comprises:

It can be seen from the foregoing embodiments that, in the failure degradation apparatus provided in the embodiments of the present invention, whether a hardware failure exists in the first type of hardware of the server is continuously detected by using an interrupt manner, so as to obtain information about the first type of hardware; continuously detecting whether hardware faults exist in second hardware of the server by using a polling mode to obtain second hardware information; continuously detecting the working state of the server to obtain working state information; the technical scheme of determining the fault state according to the first type of hardware information, the second type of hardware information and the working state information and correspondingly generating a fault degradation execution strategy to execute fault degradation can degrade and run the server when the hardware fails, maintain the server to continuously process important work and avoid file damage and data loss.

It should be particularly noted that the above embodiments of the fault degradation apparatus employ the embodiments of the fault degradation method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the fault degradation method. Of course, since the steps in the embodiment of the fault degradation method can be mutually intersected, replaced, added, and deleted, these reasonable permutation and combination transformations should also belong to the scope of the present invention for the fault degradation apparatus, and should not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method of fault degradation, comprising performing, by a baseboard management controller:

2. The method of claim 1, wherein the hardware failure of the first type of hardware comprises a first type of failure that can immediately effect a server, the first type of hardware comprising at least one of: power overload, CPU failure, memory error.

3. The method of claim 1, wherein the hardware failure of the second type of hardware comprises a second type of failure that can delay validation to affect a server, the second type of hardware comprising at least one of: ambient temperature changes, board failures, hard disk failures.

4. The method of claim 1, wherein generating a fault degradation enforcement policy based on the first class of hardware information, the second class of hardware information, and the operating state information comprises:

adding instructions degraded at a first strength to the execution policy in response to the first type of hardware information declaring a first type of failure in the first type of hardware;

adding an instruction to the execution strategy to degrade with a second strength in response to the second type of hardware information declaring that a second type of fault exists in the second type of hardware and the working state information declaring that working environment parameters related to the second type of fault are outside a normal interval, wherein the second strength is less than or equal to the first strength;

and responding to the second type of hardware information to declare that the second type of hardware has a second type of fault and the working state information to declare that working environment parameters related to the second type of fault are within a normal interval, and not adding instructions into the execution strategy.

5. The method of claim 4, wherein performing fault degradation comprises:

and controlling the central processing unit to reduce the working frequency through the general input and output bus, wherein the reduction degree of the working frequency is positively correlated to the degradation strength.

6. A fault degradation apparatus, comprising:

a processor; and

7. The apparatus of claim 6, wherein the hardware failure of the first type of hardware comprises a first type of failure that can immediately effect a server, the first type of hardware comprising at least one of: power overload, CPU failure, memory error.

8. The apparatus of claim 6, wherein the hardware failures of the second type of hardware comprise a second type of failure that can delay validation to affect a server, and wherein the second type of hardware comprises at least one of: ambient temperature changes, board failures, hard disk failures.

9. The apparatus of claim 6, wherein generating a failure degradation enforcement policy based on the first class of hardware information, the second class of hardware information, and the operating state information comprises:

10. The apparatus of claim 9, wherein performing fault degradation comprises: