Disclosure of Invention
The invention provides a method and a system for early warning of the reliability of a server power supply, which are used for solving the problem that the stability of the server power supply is affected due to the fact that an accurate fault early warning is lacked in the existing server power supply monitoring strategy.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
The first aspect of the invention provides a method for early warning of the power supply reliability of a server, which comprises the following steps:
Monitoring state information of the power supply, characteristic parameter information of each part of the power supply and index data of power supply operation respectively;
Comparing the monitored abnormal information with a preset abnormal value to obtain a corresponding risk level;
and based on the corresponding preset coping strategy of the risk level, the coping strategy comprises on-site inspection, and the early warning prompt of the server power supply is carried out by combining the on-site inspection result and the risk level, wherein the on-site inspection is used for acquiring the external environment information of the server.
Further, the method comprises the steps of:
and positioning and maintaining a fault point through the fault maintenance robot for the server power supply sending the early warning prompt.
Further, the state information includes temperature information of the power supply, a power supply output overcurrent signal and an overvoltage signal.
Further, the monitoring of the state information specifically includes:
and responding to the state information abnormal alarm reported by the baseboard management controller, calling the complex programmable logic device to acquire the state information of an alarm item corresponding to the current abnormal alarm, comparing the state information with the state information acquired by the baseboard management controller, and forming abnormal information if the comparison result is consistent.
Further, the power supply component for monitoring the characteristic parameter information comprises a power factor correction feedback circuit, a diode circuit, a communication optocoupler, driving chips of each path and a standby control circuit.
Further, the monitoring of the characteristic parameter information specifically includes:
The complex programmable logic device polls the characteristic parameter information in a server power supply register, and the characteristic parameter information is collected in real time through a sensor;
comparing the characteristic parameter information with a preset value, recording the occurrence times of the abnormal information, marking the occurrence times according to a preset rule, and taking the marked value as the abnormal information.
Further, the index data includes output power consumption of the power supply, output current, and voltage value of the output signal.
Further, the monitoring of the index data specifically includes:
The complex programmable logic device polls index data in a power register of the server, and the index data is acquired and/or calculated in real time through a power chip;
comparing the index data with a preset value, recording the occurrence times of the abnormal information, marking the occurrence times according to a preset rule, and taking the marked value as the abnormal information.
The second aspect of the present invention provides a server power reliability early warning system, the system comprising:
The power supply online monitoring module is used for respectively monitoring state information of the power supply, characteristic parameter information of each part of the power supply and index data of power supply operation;
the reliability early warning module is used for comparing the monitored abnormal information with a preset abnormal value to obtain a corresponding risk level;
The data center machine room control module is used for carrying out early warning prompt on a server power supply based on the risk level corresponding to a preset coping strategy, wherein the coping strategy comprises on-site inspection, and the on-site inspection is used for acquiring external environment information of the server in combination with an on-site inspection result and the risk level.
Further, the system also comprises a server power supply maintenance module, wherein the server power supply maintenance module is used for positioning a fault point through the fault maintenance robot and maintaining the fault point according to a preset strategy for the server power supply sending the early warning prompt.
The early warning system for the power supply reliability of the server according to the second aspect of the present invention can implement the methods in the first aspect and the implementations of the first aspect, and achieve the same effects.
The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
the invention sets multi-azimuth monitoring of the server power supply, comprising state information, index data and characteristic parameters, carries out polling monitoring on the state information, and obtains accurate power supply state information through CPLD verification on the state information monitored by the BMC, thereby avoiding the situation of false alarm in the existing BMC single mode monitoring and ensuring the accuracy of early warning. The early warning power supply is positioned and overhauled through the machine room robot, so that personnel are prevented from entering the machine room, the influence of the environment of the machine room is avoided, and the labor cost is saved.
Detailed Description
In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different structures of the invention. In order to simplify the present disclosure, components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and processes are omitted so as to not unnecessarily obscure the present invention.
The embodiment of the invention provides a pre-warning method for the reliability of a server power supply, which comprises the following steps:
s1, respectively monitoring state information of a power supply, characteristic parameter information of each part of the power supply and index data of power supply operation;
S2, comparing the monitored abnormal information with a preset abnormal value to obtain a corresponding risk level;
s3, corresponding to a preset coping strategy based on the risk level, wherein the coping strategy comprises on-site inspection, and early warning prompt of a server power supply is carried out by combining an on-site inspection result and the risk level, wherein the on-site inspection is used for acquiring external environment information of the server.
In one implementation manner of the embodiment of the present invention, the method further includes the steps of:
and positioning and maintaining a fault point through the fault maintenance robot for the server power supply sending the early warning prompt.
In step S1, the status information includes temperature information of the power supply, a power supply output overcurrent signal and an overvoltage signal.
The monitoring of the state information specifically comprises the steps that the server BMC polls the state information of the server power supply and compares the state information with a specification value, if the state information of the power supply does not meet the requirement of the specification value, the BMC displays a power supply alarm state, and based on the power supply alarm state, the CPLD reads parameter information of the server alarm power supply and compares the parameter information with alarm information fed back by the BMC. If the comparison results are different, the BMC is issued with a command to read again and feed back until the comparison results are the same. If the fault alarm information is matched with the fault alarm information fed back by the BMC, the fault alarm information is determined to be abnormal information.
The CPLD real-time polls characteristic parameter (real-time voltage value and working state of PFC feedback circuit, PFC OVP detection loop, etc., working temperature and state of key circuit diode, communication optocoupler, isolation driving IC, standby control chip, standby integrated chip) index data of each part of power supply collected in real time by a power supply sensor in a register of the server power supply, and the CPLD records the parameter data of each key part of the power supply, analyzes the key parameter data of each part of the power supply and compares the key parameter data with a specification interval. The power state output is 00 if the polling data is within the standard specification range, 01 if the polling data exceeds the standard specification range, and 10 if the polling data exceeds the standard specification range for 3 times continuously.
The CPLD monitors the output power consumption, output current, output signal voltage (12V, vingood, alert, PG) and other index data of the server power supply in real time, the index data are acquired through a power supply register, the power supply chip acquires and/or calculates the index data in real time and stores the index data in the power supply register, and the CPLD polls and records the parameter data and compares the parameter data with the specification value. In order to ensure the redundancy of the power supply of the whole machine, the output power consumption of the power supply of the server of the machine room is less than 50 percent (standard specification range) of the rated power of the power supply, in order to ensure the stability of the power supply of the whole machine, the output current of 2 power supplies of the server of the machine room is in accordance with the current sharing requirement (standard specification range: less than 20 percent of load, the non-current sharing degree is less than 10 percent, and the non-current sharing degree is more than 20 percent of load and less than 5 percent), and in order to ensure the power supply and communication reliability of the server power supply, the quality of output signals of 12V (standard specification range: 12.0V-12.8V), vingood (standard specification range: 2.4V-3.46V), alert (standard specification range: 2.4V-3.46V) and PG (standard specification range: 2.4V-3.46V) of the server power supply is in the specification range. If the polling data is within the standard specification interval range, the power state output is 00, if the polling data exceeds the standard specification interval range, each power abnormal state output is 01, and if the polling data exceeds the standard specification interval range for 3 times continuously, the power state output is 10. And the server power supply state online monitoring module collects the power supply alarm information and the power supply state output value and transmits the collected information to the server power supply reliability early warning module.
In step S2, the alarm information and the power status output value of each power supply of the server transmitted by the server power status online monitoring module are received and summarized, a total table of power reliability status of the server in the machine room is generated, and the server power is divided into a low risk area (power status output value is 00), a medium risk area (power status output value is less than 10) and a high risk area (power status output value is more than or equal to 10) based on the total table information of power reliability status.
And for the power supply in the risk area, generating a risk power supply list and corresponding alarm information to a data center machine room control module for power supply reliability identification and analysis. And for the power supply of the high risk area, the server power supply reliability early warning module transmits a high risk power supply list and corresponding warning information to the data center machine room control module for power supply overhaul flow analysis.
In step S3, for the risk server power supply, the machine room manager needs to determine whether the risk alarm information of each power supply is an effective alarm to be managed and controlled, and finally evaluates the early warning level of the server power supply. If the final early warning level of the server power supply is in danger and has the field inspection requirement, the accurate position of the fault server component is positioned, a server power supply index acquisition command is issued to the machine room automatic maintenance robot through the wireless transmission technology of the Internet of things, the machine room automatic maintenance robot shoots the working state of the fault power supply, acquires indexes such as videos and odors according to the fault positioning and moving to the position of the fault server, acquired data are transmitted to a machine room control module of the data center, and a machine room manager views and processes the feedback information through a visual interface of the control module.
For the high risk server power supply, a machine room manager judges whether the risk alarm information of each power supply is an effective alarm to be controlled or not, and finally evaluates the early warning level of the server power supply. If the final early warning level of the server power supply is high in risk and the requirement of on-site power supply replacement exists, a machine room manager sends a power supply replacement on-site confirmation instruction to the server power supply automatic maintenance module, the server power supply automatic maintenance module receives the requirement and positions the accurate position of a fault server component, sends a server power supply index acquisition instruction to the machine room automatic maintenance robot, and the machine room automatic maintenance robot shoots the working state of the fault power supply, acquires indexes such as videos and odors according to the fault positioning and moves to the position of the fault server to transmit acquired data to a data center machine room control module. The machine room manager checks the feedback information and finally confirms the maintenance requirement, issues a formal maintenance instruction, and the machine room automatic maintenance robot is positioned and moved to a fault power supply position to complete automatic replacement and re-electrifying operation of the fault power supply through the power line and power supply poking and inserting actions of the mechanical arm.
As shown in fig. 2, the embodiment of the invention also provides a pre-warning system for the power supply reliability of the server, which comprises a power supply on-line monitoring module 1, a reliability pre-warning module 2, a data center machine room control module 3 and a server power supply maintenance module 4.
The power supply online monitoring module 1 is used for respectively monitoring state information of a power supply, characteristic parameter information of each part of the power supply and index data of power supply operation, the reliability early warning module 2 is used for comparing monitored abnormal information with a preset abnormal value to obtain a corresponding risk grade, the data center machine room control module 3 corresponds to a preset coping strategy based on the risk grade, the coping strategy comprises on-site inspection, and early warning prompt of a server power supply is carried out by combining on-site inspection results and the risk grade, and the on-site inspection is used for acquiring external environment information of the server. And the server power supply overhaul module 4 is used for positioning a fault point of the server power supply sending the early warning prompt through the fault overhaul robot and maintaining the fault point according to a preset strategy.
On one hand, the server BMC polls the server power state information and compares the server power state information with the specification value, and if the power state information does not meet the specification value requirement, the BMC displays the power alarm state and transmits 10 to the power on-line monitoring module. The power supply on-line monitoring module receives abnormal alarm information fed back by the BMC, immediately reacts, reads parameter information of the server alarm power supply through the CPLD, and compares the parameter information with the alarm information fed back by the BMC. If the comparison results are different, the BMC is issued with a command to read again and feed back until the comparison results are the same. If the power supply on-line monitoring module is matched with the fault alarm information fed back by the BMC, the power supply on-line monitoring module determines that the fault alarm information is correct, and transmits the fault alarm information and a power supply state output value to the reliability early warning module.
On the one hand, the CPLD real-time polls characteristic parameter (real-time voltage value and working state of PFC feedback circuit, PFC OVP detection loop, etc., working temperature and state of key circuit diode, communication optocoupler, isolation driving IC, standby control chip, standby integrated chip) index data of each part of power supply collected by the power supply sensor in the register of the server power supply in real time, the CPLD records the parameter data of each key part of power supply and transmits the parameter data to the power supply on-line monitoring module, and the server power supply state on-line monitoring module analyzes the key parameter data of each part of power supply and compares the key parameter data with the specification interval. The power state output is 00 if the polling data is within the standard specification range, 01 if the polling data exceeds the standard specification range, and 10 if the polling data exceeds the standard specification range for 3 times continuously. And the server power supply state online monitoring module collects the power supply alarm information and the power supply state output value and transmits the collected information to the reliability early warning module.
On the other hand, the CPLD monitors in real time the output power consumption, output current, output signal voltage (12V, vingood, alert, PG) and other index data of the server power supply (what these data come from), and the CPLD polls and records each parameter data and compares it with the specification value. In order to ensure the redundancy of the power supply of the whole machine, the output power consumption of the power supply of the server of the machine room is less than 50 percent (standard specification range) of the rated power of the power supply, in order to ensure the stability of the power supply of the whole machine, the output current of 2 power supplies of the server of the machine room is in accordance with the current sharing requirement (standard specification range: less than 20 percent of load, the non-current sharing degree is less than 10 percent, and the non-current sharing degree is more than 20 percent of load and less than 5 percent), and in order to ensure the power supply and communication reliability of the server power supply, the quality of output signals of 12V (standard specification range: 12.0V-12.8V), vingood (standard specification range: 2.4V-3.46V), alert (standard specification range: 2.4V-3.46V) and PG (standard specification range: 2.4V-3.46V) of the server power supply is in the specification range. If the polling data is within the standard specification interval range, the power state output is 00, if the polling data exceeds the standard specification interval range, each power abnormal state output is 01, and if the polling data exceeds the standard specification interval range for 3 times continuously, the power state output is 10. And the server power supply state online monitoring module collects the power supply alarm information and the power supply state output value and transmits the collected information to the reliability early warning module.
The reliability early warning module receives and gathers the alarm information and the power state output value of each power supply of the server transmitted by the power supply on-line monitoring module, generates a total table of the power supply reliability states of the server in the machine room, and divides the power supply of the server into a low risk area (the power state output value is 00), a medium risk area (the power state output value is smaller than 10) and a high risk area (the power state output value is larger than or equal to 10) based on the total table information of the power supply reliability states.
For low risk zone power supplies, the reliability pre-warning module communicates a low risk power supply list to the data center room control module for display. And for the power supply in the risk area, the reliability early warning module transmits the risk power supply list and corresponding warning information to the data center machine room control module for power supply reliability identification and analysis. For the high-risk area power supply, the reliability early warning module conveys a high-risk power supply list and corresponding warning information to the data center machine room control module for power supply overhaul flow analysis.
The data center machine room control module receives the server power supply risk list and the corresponding alarm information fed back by the reliability early warning module, and machine room management staff checks the machine room server power supply risk list and the corresponding alarm information through a visual interface of the data center machine room control module.
For the risk server power supply, a machine room manager needs to determine whether risk alarm information of each power supply is an effective alarm to be managed and controlled, and finally evaluates the early warning level of the server power supply. If the final early warning level of the server power supply is a risk of wind and a field inspection requirement exists, machine room management personnel issues an inspection instruction to an inspection module, the inspection module receives the requirement and positions the accurate position of a fault server component, a wireless transmission technology of the Internet of things issues a server power supply index acquisition command to an automatic machine room inspection robot, the automatic machine room inspection robot moves to the position of the fault server according to fault positioning to take photos, acquire indexes such as videos and odors of the working state of the fault power supply, acquired data are transmitted to a data center machine room control module, and the machine room management personnel views and processes the feedback information through a visual interface of the control module.
For the high risk server power supply, a machine room manager needs to determine whether risk alarm information of each power supply is an effective alarm to be managed and controlled, and finally evaluates the early warning level of the server power supply. If the final early warning level of the server power supply is high in risk and the requirement of on-site power supply replacement exists, a machine room manager sends a power supply replacement on-site confirmation instruction to an overhaul module, the overhaul module receives the requirement and positions the accurate position of a fault server component, a server power supply index acquisition command is sent to a machine room automatic overhaul robot, the machine room automatic overhaul robot shoots the working state of the fault power supply, acquires indexes such as videos and odors according to the fault positioning and moves to the position of the fault server, and acquired data are transmitted to a data center machine room control module. The machine room manager checks the feedback information and finally confirms the maintenance requirement, issues a formal maintenance instruction, and the machine room automatic maintenance robot is positioned and moved to a fault power supply position to complete automatic replacement and re-electrifying operation of the fault power supply through the power line and power supply poking and inserting actions of the mechanical arm.
The scheme can also realize automatic monitoring, early warning and maintenance of the power supply reliability of the server in the machine room.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.