CN112905377B

CN112905377B - Method, system and medium for avoiding BMC breakdown

Info

Publication number: CN112905377B
Application number: CN202110181358.9A
Authority: CN
Inventors: 黄家安
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2022-09-20
Anticipated expiration: 2041-02-10
Also published as: CN112905377A

Abstract

The invention discloses a method for avoiding BMC (baseboard management controller) crash, which comprises the following steps: executing a monitoring strategy, and monitoring the CPU utilization rate and an interrupt instruction; setting a flag area, a non-flag area, an interrupt parameter threshold value and a flag threshold value; when the CPU utilization rate reaches a flag area, if an instruction generator generates an interrupt instruction, executing an incremental strategy on an interrupt parameter and a flag corresponding to the instruction generator; according to the method, when the CPU utilization rate reaches the non-flag area, the decrement strategy is executed on the flag corresponding to the instruction generator, and when the server instruction generator is abnormal in factory manufacturing, the fault judgment of testers caused by the fact that the BMC cannot respond due to the fact that the BMC is crashed can be avoided.

Description

Method, system and medium for avoiding BMC breakdown

Technical Field

The present invention relates to the field of server detection technologies, and in particular, to a method, a system, and a medium for avoiding a BMC crash.

Background

When a user presses a power instruction generator to start a server, a BMC records an SEL log to record that a server system is pressed by the power instruction generator, when the user presses a restart instruction generator, the BMC records an SEL log to record that the server system is pressed by the restart instruction generator, when the user presses the interrupt instruction generator, the BMC records a system event log to record that the server system is pressed by the interrupt instruction generator, the BMC judges the high-low level change of a GPIO pin to know that the instruction generator on the server system is pressed by the user, and then the GPIO interrupt function is used for quickly processing the event corresponding to the pressed instruction generator.

However, when the instruction generator has a problem, many interrupts are generated in a short time, which causes centralized processing of these interrupt functions by the BMC, so that the CPU utilization of the BMC is reduced, and thus the upper layer service cannot be processed immediately and effectively.

Disclosure of Invention

The invention mainly solves the problem that when the server instruction generator is abnormal, the BMC repeatedly processes abnormal interruption, so that the CPU utilization rate of the BMC is reduced or the BMC crashes.

In order to solve the technical problems, the invention adopts a technical scheme that: a method for avoiding BMC crash is provided, which comprises the following steps:

executing a monitoring strategy, and monitoring the CPU utilization rate and an interrupt instruction of an instruction generator;

setting a flag area, a non-flag area, an interrupt parameter threshold value and a flag threshold value;

when the CPU utilization rate reaches the flag area, if the instruction generator generates the interrupt instruction, executing an increasing strategy on the interrupt parameter and the flag corresponding to the instruction generator;

and when the CPU utilization rate reaches the non-flag area, executing a decreasing strategy on the flag corresponding to the instruction generator.

Further, the step of setting the flag area, the non-flag area, the interrupt parameter threshold value and the flag threshold value further includes: the non-flag area comprises a first threshold value interval, a second threshold value interval and a buffer interval, and the flag area comprises a first parameter interval and a second parameter interval.

Further, the increment policies include a first increment policy and a second increment policy;

when the CPU utilization rate reaches the first parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, if so, the first increment strategy is executed on the interrupt parameter according to the interrupt instruction, when the interrupt parameter reaches the interrupt parameter threshold, the first increment strategy is executed on the flag, and when the flag is larger than the flag threshold, the emergency strategy is executed.

Further, when the CPU utilization rate reaches the second parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, if so, the second increment strategy is executed on the interrupt parameter according to the interrupt instruction, when the interrupt parameter reaches the interrupt parameter threshold, the first increment strategy is executed on the flag, and when the flag is greater than the flag threshold, the emergency strategy is executed.

Further, the step of executing the emergency policy further comprises: and closing the interrupt processing program corresponding to the instruction generator, stopping processing the interrupt instruction of the instruction generator, storing the information of closing the interrupt processing program in a system event log, and executing alarm operation.

Further, the step of executing a decrement policy on the flag corresponding to the instruction generator when the CPU utilization reaches the non-flag region further includes: and if the CPU utilization rate reaches a first threshold interval, initializing the interrupt parameters and the flag corresponding to the instruction generator.

Further, the step of executing a decrement policy on the flag corresponding to the instruction generator when the CPU utilization reaches the non-flag region further includes:

when the CPU utilization rate reaches the second threshold interval, executing the first decreasing strategy on the flag corresponding to the instruction generator, judging whether the interrupt processing program corresponding to the instruction generator is in a closed state, if the interrupt processing program of the instruction generator is in the closed state, judging whether the flag of the instruction generator is equal to the flag threshold, if so, keeping the interrupt processing program of the instruction generator in the closed state, if not, starting the interrupt processing program of the instruction generator, and storing the information of the started interrupt processing program in the system event log.

Further, the first increment strategy is to increment the interruption parameter or the flag by taking the first parameter as a base number, and when the interruption parameter or the flag reaches the interruption parameter threshold or the flag threshold, the increment is continued from an initial value;

the second increment strategy is to increment the interruption parameter or the flag by taking the second parameter as a base number, and when the interruption parameter or the flag reaches the interruption parameter threshold or the flag threshold, the increment is continued from an initial value.

The invention also provides a system for avoiding the BMC crash, which is applied to a server, and comprises: the system comprises a monitoring module, a weighting module, a strategy module and a system event module;

the monitoring module is used for monitoring the CPU utilization rate of the server and an interrupt instruction of an instruction generator of the server, and setting a flag area and a non-flag area;

the strategy module stores an increasing strategy, a decreasing strategy and an emergency strategy;

when the CPU utilization rate reaches the flag area, the weighting module executes the increment strategy, executes the interrupt parameter weighting and the flag weighting according to the interrupt instruction, and executes the emergency strategy on an interrupt processing program corresponding to the instruction generator;

when the CPU utilization rate reaches a non-flag area, the weighting module executes the decrement strategy on the flag and executes the emergency strategy on an interrupt processing program corresponding to the instruction generator;

the system event module is used for storing the information of the interrupt processing program.

The invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the method steps for avoiding a BMC crash.

The invention has the beneficial effects that:

1. the method for avoiding the BMC crash can avoid that when a server instruction generator is abnormal during the manufacture of a server in a factory, the BMC cannot respond due to the crash of the BMC, so that a server tester mistakenly thinks that the problem occurs in the design of the BMC firmware of the server, can locate the abnormal instruction generator through a system event log, and does not need a server research and development worker to carry out the factory to confirm the reason of the abnormality of the server instruction generator.

2. The system for avoiding the BMC crash can weight the interrupt instruction of the instruction generator according to the strategy of the strategy module, close the interrupt processing program according to the interrupt threshold after the weighting is finished, open the interrupt processing program again when the interrupt processing program opening condition is met, store the system event of the closed or opened interrupt processing program information, and facilitate the maintenance of maintenance personnel.

3. The medium for avoiding the BMC breakdown can realize an increasing or decreasing strategy of the interrupt parameter corresponding to the instruction generator, generates a flag when the interrupt parameter reaches a threshold value, judges whether the instruction generator is abnormal or not through the flag, and closes an interrupt processing program of the abnormal instruction generator after the abnormality occurs, so that the load pressure of a CPU is reduced, and the normal operation of the CPU is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of a method for avoiding a BMC crash according to embodiment 1 of the invention;

FIG. 2 is a flowchart of a method for avoiding a BMC crash according to embodiment 1 of the invention;

FIG. 3 is a block diagram of a method for avoiding BMC crash according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram of a system for avoiding a BMC crash according to embodiment 2 of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description of the present invention, bmc (baseboard Management controller) is a board Management controller, and cpu (central Processing unit) is a central Processing unit.

Example 1

The embodiment of the invention provides a method for preventing BMC crash, please refer to fig. 1 to 3, which includes the following steps;

in order to avoid the phenomenon that when a server instruction generator is abnormal, the instruction generator transmits an interrupt function to the BMC abnormally, so that the CPU load of the BMC is increased instantly, and the BMC is crashed, the invention firstly creates a monitoring strategy, continuously monitors the CPU utilization rate of the server, sets a CPU utilization rate threshold value and an interrupt parameter, divides the CPU utilization rate into a flag area and a non-flag area according to the CPU utilization rate, the flag area is that the CPU utilization rate is more than or equal to 70 percent, the non-flag area is that the CPU utilization rate is less than 70 percent, when the CPU utilization rate reaches 70 percent, the CPU utilization rate is less than 90 percent, and determines whether the interrupt of the instruction generator occurs on the server, because each instruction generator corresponds to an interrupt response program, an interrupt counter counts the interrupt generated by each instruction generator, initializes the interrupt parameter when corresponding to the start, and the interrupt parameter is 0, when the pressing instruction generator is interrupted, adding 1 to the interruption parameter, and if the interruption is not generated, continuing to use the monitoring strategy to monitor the CPU utilization rate;

when the monitoring strategy monitors that the CPU utilization rate is more than or equal to 90 percent, the CPU can not respond to the interrupt response with intensive burst, so the interrupt parameters are further weighted to confirm whether the interrupt operation of the instruction generator occurs on the server, and if the interrupt operation occurs, the interrupt parameter is added by 2.

The addition of 1 or 2 to the interrupt parameter is only a simple weighting relationship, the number is not a problem to be discussed or a protection range, and the addition of 1 or 2 is only an ideal case, and a better interrupt parameter can be properly selected for weighting according to the running state of the server, or the application scene of the server, and the performance of the server CPU.

When the interrupt parameter is greater than the upper limit 10 of the interrupt parameter, a flag is generated, when the number of the flag is greater than 10, interrupt processing programs corresponding to an instruction generator which generates interrupt, such as a shutdown instruction generator and a restart instruction generator, are closed, namely the instruction generator is judged to be abnormal, a CPU refuses to receive the interrupt instruction of the instruction generator, the normal operation of upper-layer services of a server is protected, the information log of the instruction generator is recorded in an SEL log system event log while the instruction generator is closed, and a worker is reminded in this way, and the normal operation of the CPU is influenced when the instruction generator is abnormal, so that the worker is reminded to quickly make a processing response.

The monitoring strategy monitors that the CPU utilization rate is lower than 70%, and when the CPU utilization rate reaches a non-flag area, the CPU utilization rate can already process some interrupt instructions, so that the CPU opens the closed interrupt processing program corresponding to the instruction generator according to the situation.

When the CPU utilization rate is lower than 70% and is more than or equal to 60%, the CPU utilization rate is equal to the buffer area, the monitoring strategy does not process the interrupt processing program, but continuously waits for the CPU utilization rate to continuously change, and waits for the CPU utilization rate to increase or decrease.

The monitoring strategy monitors that the utilization rate of the CPU is less than or equal to 60 percent and is more than 50 percent, the flag number is reduced by one, the flag number is 0 at least, because the residual utilization rate of the CPU is enough to deal with various interrupt instructions, the interrupt operation can be tried, the monitoring strategy confirms the current interrupt processing program state of the instruction generator, when the flag number of the instruction generator is reduced, if the flag number of the instruction generator is equal to 10, namely the original flag number of the instruction generator is 11, the instruction generator possibly has serious problems, and a maintenance worker does not timely process the problem, the interrupt processing program of the instruction generator is kept closed, if the instruction generator is subjected to twice flag number reduction processing, namely the utilization rate of the CPU is kept stable and exceeds a non-flag area, the utilization rate of the CPU performs at least one reciprocating motion in a buffer area and an interval of 60 percent to 50 percent of the CPU utilization rate, and when the monitoring strategy monitors that the flag number of the instruction generator is less than 10, opening the interrupt processing program of which the instruction generator is closed, recording the opening information of the instruction generator in a system time log, informing a maintenance worker that the instruction generator is opened, and processing the interrupt instruction of the instruction generator by the CPU.

When the monitoring strategy monitors that the utilization rate of the CPU is less than or equal to 50%, an initialization strategy is triggered, and the initialization strategy clears the interrupt parameters and the flag number, so that the CPU can process a large amount of services and can execute a large amount of interrupt instructions.

Example 2

An embodiment of the present invention provides a system for avoiding BMC crash, which is applied to a server, please refer to fig. 4, where the system includes;

the system comprises a monitoring module, a weighting module, an instruction generator, a strategy module and a system event module.

The instruction generator is used for carrying out various instruction control on the server, such as a shutdown instruction, a restart instruction, a standby instruction and the like, the instruction generator generates an interrupt to the server BMC every time the instruction generator generates an instruction, when the instruction generator generates an exception, the instruction is repeatedly sent to the BMC, a CPU of the BMC is crashed due to the fact that a large number of instructions are received instantly, and the BMC cannot work normally.

The monitoring module monitors the utilization rate of the CPU, and divides a flag interval and a flag interval according to the utilization rate of the CPU, in the embodiment, according to the performance index of the CPU, when the utilization rate of the CPU is more than or equal to 70%, the flag interval is set, and when the utilization rate of the CPU reaches 70%, the non-flag interval is set, and when the utilization rate of the CPU reaches 70%, the monitoring module further divides the interval.

When the CPU utilization rate is more than or equal to 70% and less than 90%, the monitoring module monitors whether the instruction generator generates an interrupt instruction, if the interrupt instruction is generated, the weighting module and the strategy module are matched to weight the interrupt parameter and the flag, the monitoring module performs independent weighting on each instruction generator when monitoring that the shutdown instruction generator or the restart instruction generator generates the interrupt instruction, each time the monitoring module monitors that the interrupt instruction is generated, the weighting module instructs a first increasing strategy of the strategy module, namely, 1 is added each time, the interrupt instruction of the instruction generator is increased in interrupt, when the interrupt parameter is increased to 10, the first increasing strategy of the strategy module is executed on the flag, the flag is increased in steps, when the flag number is more than 10, the interrupt processing program corresponding to the instruction generator is stopped, and the BMC does not receive the interrupt instruction of the interrupt generator any more, because the interrupt instruction is frequently generated, which indicates that the interrupt generator is abnormal, the stopped interrupt processing program is recorded in the system event module, and in this way, the field can be protected, so that maintenance personnel can maintain the abnormal instruction generator.

Similarly, when the utilization rate of the CPU is greater than or equal to 90%, it indicates that the available space of the CPU is very small, and a faster weighting strategy is required, and each time the monitoring module monitors that the shutdown instruction generator or the restart instruction generator generates an interrupt instruction, the interrupt parameter of the interrupt instruction generated by the instruction generator is subjected to a second increment strategy, i.e. every increment by 2, when the interrupt parameter is incremented to 10, a first incremental strategy of the strategy module is performed on the flag, and when the number of flags is greater than 10, stopping the interrupt processing program corresponding to the instruction generator, so that the BMC does not receive the interrupt instruction of the interrupt generator any more, because the interrupt instruction is frequently generated, which indicates that the interrupt generator is abnormal, the stopped interrupt handler is recorded in the system event module, in this way, the site can be protected so that maintenance personnel can maintain the abnormal command generator.

When the CPU utilization rate is reduced to a non-flag area, the CPU utilization rate is subdivided, when the CPU utilization rate is less than 70% and is more than or equal to 60%, the CPU utilization rate is divided into a buffer area, whether the CPU utilization rate is stable or not is not known, so that when the CPU utilization rate is the buffer area, the interrupt instructions are not counted, and the interrupt processing program of the instruction generator is not opened or closed.

When the CPU utilization rate is reduced to less than 60% and more than 50%, once the CPU utilization rate enters once, all the interrupt processors are subjected to a flag decrement strategy, namely the flag of the instruction generator is decremented by 1, since the interrupt processing is shut down only when flag is greater than 10, i.e., 11, therefore, when the CPU utilization rate is reduced to less than 60% and more than 50%, and when the flag corresponding to the instruction generator is less than 10, the closed interrupt processing program is opened, which indicates that the CPU utilization rate is reduced to less than 60% and more than 50% twice, and is stable, therefore, when the flag is equal to 10, the closed interrupt processing program is not opened, and when the closed interrupt processing program is opened, the processed condition is sent to the system event module for storage, and a repairer is informed that the instruction generator has undergone a restart and needs to be carefully confirmed.

When the CPU utilization rate is less than or equal to 50%, initializing the interrupt parameters and the flag, wherein the interrupt parameters and the flag are initialized normally, namely cleared, and the initialization values can be set according to the performance of the CPU

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps of the above embodiments may be implemented by hardware, or a program executed by hardware and instructed by a program to be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for avoiding BMC crash is characterized by comprising the following steps:

setting a flag area, a non-flag area, an interruption parameter threshold and a flag threshold, wherein the non-flag area comprises a first threshold interval, a second threshold interval and a buffer interval, and the flag area comprises a first parameter interval and a second parameter interval;

when the CPU utilization rate reaches the flag area, if the instruction generator generates the interrupt instruction, executing an increasing strategy on the interrupt parameter and the flag corresponding to the instruction generator; the increment policies include a first increment policy and a second increment policy;

when the CPU utilization rate reaches the first parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, if so, the first incremental strategy is executed on the interrupt parameter according to the interrupt instruction, when the interrupt parameter reaches the interrupt parameter threshold, the first incremental strategy is executed on the flag, and when the flag is greater than the flag threshold, the emergency strategy is executed;

when the CPU utilization rate reaches the second parameter interval, the monitoring strategy monitors whether the instruction generator generates the interrupt instruction, if so, the second incremental strategy is executed on the interrupt parameter according to the interrupt instruction, when the interrupt parameter reaches the interrupt parameter threshold, the first incremental strategy is executed on the flag, and when the flag is greater than the flag threshold, the emergency strategy is executed;

the step of executing the emergency policy includes: closing an interrupt processing program corresponding to the instruction generator, stopping processing of the interrupt instruction of the instruction generator, storing information of closing the interrupt processing program in a system event log, and executing an alarm operation;

when the CPU utilization rate reaches the non-flag area, executing a decreasing strategy on the flag corresponding to the instruction generator; if the CPU utilization rate reaches a first threshold interval, initializing the interrupt parameters and the flag corresponding to the instruction generator; when the CPU utilization rate reaches the second threshold interval, executing the first decreasing strategy on the flag corresponding to the instruction generator, judging whether the interrupt processing program corresponding to the instruction generator is in a closed state, if the interrupt processing program of the instruction generator is in the closed state, judging whether the flag of the instruction generator is equal to the flag threshold, if so, keeping the interrupt processing program of the instruction generator in the closed state, if not, starting the interrupt processing program of the instruction generator, and storing the information of the started interrupt processing program in the system event log.

2. The method of claim 1, wherein the BMC crash prevention step comprises: the first increment strategy is to increment the interruption parameter or the flag by taking the first parameter as a base number, and when the interruption parameter or the flag reaches the interruption parameter threshold or the flag threshold, the increment is continued from an initial value;

3. A system for avoiding a BMC crash, applied to a server, using the method for avoiding a BMC crash of claim 1 or 2, wherein the system comprises: the system comprises a monitoring module, a weighting module, a strategy module and a system event module;

4. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the method steps of avoiding a BMC crash of claim 1 or 2.