WO2021068814A1 - 硬件设备异常监控方法、装置、服务器及计算机可读存储介质 - Google Patents

硬件设备异常监控方法、装置、服务器及计算机可读存储介质 Download PDF

Info

Publication number
WO2021068814A1
WO2021068814A1 PCT/CN2020/119081 CN2020119081W WO2021068814A1 WO 2021068814 A1 WO2021068814 A1 WO 2021068814A1 CN 2020119081 W CN2020119081 W CN 2020119081W WO 2021068814 A1 WO2021068814 A1 WO 2021068814A1
Authority
WO
WIPO (PCT)
Prior art keywords
hardware device
indicator
abnormality
hardware
indicators
Prior art date
Application number
PCT/CN2020/119081
Other languages
English (en)
French (fr)
Inventor
何明烨
龙凯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068814A1 publication Critical patent/WO2021068814A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored

Definitions

  • This application relates to the technical field of hardware monitoring, and in particular to a method, device, server, and computer-readable storage medium for monitoring abnormalities of hardware equipment.
  • the inventor realizes that the above method not only wastes manpower, but is also inefficient and has a certain hysteresis. If the operation and maintenance personnel do not have a reliable method of troubleshooting after the problem is found, they are completely dependent on contacting the after-sales personnel to solve the problem, and the failure to repair the found problem in time will also affect the work progress of the hardware equipment and cause greater losses. In addition, with the increasing number of devices in the network, it is no longer possible for operation and maintenance personnel to enter the computer room to manage each machine, and how to effectively carry out remote control and management becomes more and more important.
  • This application proposes a method for monitoring abnormalities of hardware equipment, which includes the steps:
  • SaltStack management tool uses the SaltStack management tool to uniformly set various indicators and corresponding thresholds that hardware devices need to monitor;
  • an early warning notification When an indicator is abnormal, an early warning notification will be issued according to a preset method.
  • This application provides a hardware equipment abnormality monitoring device, which includes:
  • Collection module used to collect various index data of the hardware device in a preset manner through the intelligent platform management interface
  • Obtaining module used to obtain thresholds set by various indicators
  • Judgment module used to compare the collected index data with the corresponding threshold to judge whether there is an abnormality
  • Notification module Used to send out an early warning notification in a preset way when an indicator is abnormal.
  • the present application also provides a server, including a memory and a processor, the memory stores a hardware device abnormality monitoring program that can run on the processor, and the hardware device abnormality monitoring program is implemented when the processor is executed The following steps:
  • SaltStack management tool uses the SaltStack management tool to uniformly set various indicators and corresponding thresholds that hardware devices need to monitor;
  • an early warning notification When an indicator is abnormal, an early warning notification will be issued according to a preset method.
  • the present application also provides a computer-readable storage medium that stores a hardware device abnormality monitoring program, and the hardware device abnormality monitoring program can be executed by at least one processor, so that the at least one processor Perform the following steps:
  • SaltStack management tool uses the SaltStack management tool to uniformly set various indicators and corresponding thresholds that hardware devices need to monitor;
  • an early warning notification When an indicator is abnormal, an early warning notification will be issued according to a preset method.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the server of this application.
  • FIG. 2 is a schematic diagram of the program modules of the first embodiment of the device for monitoring abnormality of hardware equipment according to the present application;
  • FIG. 3 is a schematic diagram of program modules of a second embodiment of a hardware device abnormality monitoring apparatus according to the present application.
  • FIG. 4 is a schematic diagram of a program module of a third embodiment of a hardware device abnormality monitoring apparatus according to the present application.
  • FIG. 5 is a schematic flowchart of a first embodiment of a method for monitoring an abnormality of a hardware device according to the present application
  • FIG. 6 is a schematic flowchart of a second embodiment of a method for monitoring an abnormality of a hardware device according to the present application
  • FIG. 7 is a schematic flowchart of a third embodiment of a method for monitoring an abnormality of a hardware device according to the present application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the server 2 of the present application.
  • the server 2 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can communicate with each other through a system bus. It should be pointed out that FIG. 1 only shows the server 2 with components 11-13, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the server 2 may be a computing device such as a rack server, a blade server, a tower server, or a cabinet server.
  • the server 2 may be an independent server or a server cluster composed of multiple servers.
  • the memory 11 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the storage 11 may be an internal storage unit of the server 2, for example, a hard disk or a memory of the server 2.
  • the memory 11 may also be an external storage device of the server 2, for example, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital) equipped on the server 2. Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both the internal storage unit of the server 2 and its external storage device.
  • the memory 11 is generally used to store an operating system and various application software installed on the server 2, for example, the program code of the hardware equipment abnormality monitoring apparatus 200, and so on.
  • the memory 11 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 12 is generally used to control the overall operation of the server 2.
  • the processor 12 is used to run the program code or process data stored in the memory 11, for example, to run the program code of the hardware equipment abnormality monitoring apparatus 200.
  • the network interface 13 may include a wireless network interface or a wired network interface, and the network interface 13 is usually used to establish a communication connection between the server 2 and other electronic devices.
  • this application proposes a hardware device abnormality monitoring device 200.
  • FIG. 2 is a program module diagram of the first embodiment of a hardware equipment abnormality monitoring apparatus 200 of the present application.
  • the hardware device abnormality monitoring apparatus 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the hardware device abnormality of each embodiment of the present application can be realized. Monitor operation.
  • the hardware device abnormality monitoring apparatus 200 may be divided into one or more modules based on specific operations implemented by each part of the computer program instructions. For example, in FIG. 2, the hardware equipment abnormality monitoring apparatus 200 can be divided into a setting module 201, an acquisition module 202, an acquisition module 203, a judgment module 204, and a notification module 205. among them:
  • the setting module 201 is used to set the indicators that the hardware device needs to monitor and the thresholds corresponding to each indicator.
  • IPMI Intelligent Platform Management Interface
  • SaltStack management tools to achieve unified batch management and abnormal monitoring of hardware devices.
  • IPMI Intelligent Platform Management Interface
  • SaltStack management tools to achieve unified batch management and abnormal monitoring of hardware devices.
  • IPMI Intelligent Platform Management Interface
  • SaltStack management tools to achieve unified batch management and abnormal monitoring of hardware devices.
  • IPMI Intelligent Platform Management Interface
  • SaltStack management tools to achieve unified batch management and abnormal monitoring of hardware devices.
  • IPMI is an open standard hardware management interface specification that defines a specific method for embedded management subsystems to communicate. It is an industrial standard used to manage peripheral devices used in enterprise systems based on Intel architecture.
  • the SaltStack management tool allows administrators to create a consistent management system for multiple operating systems.
  • the three major functions of SaltStack include remote execution, configuration management, and cloud management.
  • SaltStack acts on the slave and master topologies.
  • SaltStack can be executed in one or more subordinates in combination with specific commands.
  • SaltStack allows administrators to use "grain".
  • the grain can run remote queries on the SaltStack servants, thus collecting the state information of the servants and allowing the administrator to store the information in a central location.
  • SaltStack can also help administrators define the desired state on the target system. These states will be used in the application. Sls file, which contains very specific requirements on how to obtain the required state on the system.
  • the indicators may include status information such as power supply, temperature, voltage, fan, battery, processor, memory, hard disk, and logs.
  • each indicator contains multiple refined indicators.
  • the hard disk status information includes the chip version, status, cache status data, RAID level status of the RAID card, etc.
  • power status information includes voltage, power consumption, power running status, power failure, power supply quantity, etc.
  • temperature status information includes CPU temperature, motherboard temperature, fan temperature, hard disk temperature, room temperature, etc.
  • the indicators include two types of numerical data indicators and non-numerical data indicators.
  • numerical data indicators a corresponding threshold value needs to be set in advance, and after comparing with the threshold value, it is judged whether there is an abnormality; for non-numerical data indicators, a fault is detected ( Can not work) directly alarm. For example, when the power supply is monitored, it will directly alarm; while the temperature and voltage are numerical data, and there are alarm thresholds, and the thresholds are set according to the actual situation according to the standards and requirements of the computer room.
  • a corresponding topology structure such as a tree structure
  • the threshold set for each index is also stored in the corresponding position of the topology structure, that is, the hardware device corresponding to the threshold and the node position of the index in the topology structure.
  • the collection module 202 is configured to collect various index data of the hardware device in a preset manner.
  • IPMI information is communicated through the Baseboard Management Controller (BMC) located on the hardware components of the IPMI specification.
  • BMC Baseboard Management Controller
  • Using low-level hardware intelligent management instead of operating system management has two main advantages: First, this configuration allows out-of-band server management; second, the operating system does not have to bear the task of transmitting system state data. Users can use the IPMI interface to monitor the physical health characteristics of hardware devices, such as temperature, voltage, fan working status, power supply status, etc. This standard applies to different server topologies, as well as Windows, Linux, Solaris, Mac or mixed operating systems. In addition, since IPMI can operate under different attribute values, IPMI can still operate normally even if the server itself does not operate normally or cannot provide services for any reason. Therefore, through the IPMI protocol interface, it is possible to collect and monitor the various index data of the hardware device.
  • BMC Baseboard Management Controller
  • a corresponding tree structure is configured for the multiple hardware devices that need to be monitored and the indicators that each hardware device needs to monitor.
  • the collection module 202 traverses each node in the tree structure according to the tree structure. Send the corresponding IPMI command to each hardware device to collect the data of the corresponding indicator.
  • the SaltStack technology can be used for remote management, and the various indicator data can be collected remotely in combination with the IPMI interface.
  • the acquiring module 203 is used to acquire thresholds set by various indicators.
  • the thresholds set for the various indicators are obtained respectively.
  • a corresponding alarm threshold is preset, and the threshold needs to be acquired; while for the non-numerical data indicator, the corresponding threshold does not need to be acquired, and the alarm is directly triggered when a fault is detected.
  • the judging module 204 is used to compare the collected index data with corresponding thresholds to judge whether there is an abnormality.
  • the various index data of the hardware device collected through the IPMI interface are respectively compared with the obtained corresponding thresholds.
  • the numerical data index according to the standards and requirements of the computer room, when the collected data exceeds or falls below the corresponding threshold (selected according to actual needs), it is judged that the index is abnormal.
  • the non-numerical data indicator when the collected data indicates that a fault is found, it is directly judged that the indicator is abnormal.
  • the notification module 205 is configured to issue an early warning notification in a preset manner when an indicator is abnormal.
  • an early warning notification can be issued through a variety of preset methods.
  • the alarm display of the various indicators can be performed in the form of a page, so as to realize a visual early warning.
  • the hardware device abnormality monitoring device can use the IPMI interface combined with the SaltStack management tool to realize unified batch management and abnormal monitoring of hardware devices, uniformly set the indicators that the hardware devices need to monitor and the thresholds corresponding to each indicator, through IPMI
  • the interface collects various indicator data of the hardware device, compares the collected various indicator data with corresponding thresholds, and judges whether there is an abnormality, and when an indicator is abnormal, a visual warning is given through the page.
  • the system can customize the hardware devices that need to be monitored and corresponding indicators and thresholds based on specific business scenarios, and perform customized analysis and detect alarms based on the configuration of various indicator data collected through the IPMI interface. For indicators that determine abnormalities, visual management, configuration, and query are achieved through the monitoring display platform, which facilitates the discovery and processing of monitoring, improves the timeliness of finding abnormalities, and the efficiency of handling abnormalities.
  • the hardware device abnormality monitoring device 200 includes a recording module in addition to the setting module 201, the collection module 202, the acquisition module 203, the judgment module 204, and the notification module 205 in the first embodiment. 206. Backtracking module 207.
  • the recording module 206 is used to record the processing feedback information of the abnormality.
  • the processing feedback information may include the reason for the abnormality, the processing time, the processing process, the processing result, the processing person, and the like.
  • the retrospective module 207 is used to save the collected index data and process feedback information, so as to perform retrospective operations.
  • the collected index data is stored in the database, so all the data can be queried.
  • the processing feedback information is recorded, it can be traced back.
  • storing the collected index data and corresponding processing feedback information in the database can be used as a basis for subsequent provision of retrospective services.
  • users need to backtrack they can query each index data recorded and the corresponding processing feedback information for further processing such as statistical analysis and hardware optimization, which is helpful for better equipment monitoring and improvement.
  • the hardware device abnormality monitoring device can apply the IPMI interface combined with the saltstack technology to realize remote and batch management and data collection of the IPMI underlying hardware devices.
  • it records the processing feedback information of the abnormality, saves the collected index data and processing feedback information, and provides retrospective services, which facilitates the subsequent retrieval of the data and feedback information for statistical analysis and optimization, etc. .
  • visual management, configuration, and query are achieved through the monitoring display platform, which facilitates the discovery, processing and backtracking of monitoring, and improves the timeliness of finding abnormalities and the efficiency of handling abnormalities.
  • the hardware device abnormality monitoring device 200 includes the setting module 201, the collection module 202, the acquisition module 203, the judgment module 204, the notification module 205, the recording module 206, and the traceback module in the second embodiment.
  • the hardware device abnormality monitoring device 200 includes the setting module 201, the collection module 202, the acquisition module 203, the judgment module 204, the notification module 205, the recording module 206, and the traceback module in the second embodiment.
  • it also includes a screening module 208 and a prompting module 209.
  • the screening module 208 is configured to screen out a preferred treatment plan for the abnormality according to the processing feedback information in the history record when an indicator is abnormal.
  • the corresponding processing feedback information will be recorded and stored in the database for query. Therefore, when it is determined that an indicator is abnormal, the abnormality can be queried from the database.
  • Corresponding to each historical record For example, when the motherboard temperature is abnormally high, you can query the processing feedback information every time the motherboard temperature is too high in the past. Then, according to the processing process and processing result in each historical record that is queried, a preferred processing scheme for the abnormality is screened out.
  • the preferred processing solution may be a processing solution in the history record that has a successful processing result and the shortest processing time.
  • the screening module 208 can also directly query the abnormality corresponding to the abnormality according to the mapping relationship table (which can be provided by the equipment supplier) between the abnormal problem, the abnormality reason, and the preferred treatment plan set in advance. Treatment plan.
  • the screening module 208 can also query other computer rooms for processing feedback information (not limited to local historical records) of the abnormality through the network or big data, and filter out the preferred processing solution.
  • the prompting module 209 is used to prompt the user of the preferred processing solution, so that the user can refer to and handle the abnormality.
  • the screening module 208 After the screening module 208 has screened out the preferred treatment solution corresponding to the abnormality, it prompts the user with the preferred treatment solution in a preset manner (for example, it is displayed in the form of a page).
  • the content of the preferred treatment solution includes the abnormal cause and processing method corresponding to the abnormality.
  • the user can learn the preferred treatment plan for the abnormality according to the prompt, so as to handle the abnormality by himself, without contacting and waiting for after-sales personnel to handle it.
  • the hardware device abnormality monitoring device provided in this embodiment can provide corresponding reliable and optimal treatment solutions for various abnormalities found based on the processing feedback information of historical records, which improves the efficiency and accuracy of abnormality repair, and saves time and manpower. Reduce the maintenance cost of hardware equipment in the computer room.
  • this application also proposes a method for monitoring abnormalities of hardware devices.
  • FIG. 5 is a schematic flowchart of a first embodiment of a method for monitoring an abnormality of a hardware device according to the present application.
  • the execution order of the steps in the flowchart shown in FIG. 5 can be changed, and some steps can be omitted.
  • the method includes the following steps:
  • step S400 the indicators to be monitored by the hardware device and the thresholds corresponding to each indicator are set.
  • IPMI is used in combination with the SaltStack management tool to implement unified batch management and abnormal monitoring of hardware devices.
  • IPMI is an open standard hardware management interface specification that defines a specific method for embedded management subsystems to communicate. It is an industrial standard used to manage peripheral devices used in enterprise systems based on Intel architecture.
  • the SaltStack management tool allows administrators to create a consistent management system for multiple operating systems.
  • the three major functions of SaltStack include remote execution, configuration management, and cloud management.
  • SaltStack acts on the slave and master topologies.
  • SaltStack can be executed in one or more subordinates in combination with specific commands.
  • SaltStack allows administrators to use "grain".
  • the grain can run remote queries on the SaltStack servants, thus collecting the state information of the servants and allowing the administrator to store the information in a central location.
  • SaltStack can also help administrators define the desired state on the target system. These states will be used in the application. Sls file, which contains very specific requirements on how to obtain the required state on the system.
  • the indicators may include status information such as power supply, temperature, voltage, fan, battery, processor, memory, hard disk, and logs.
  • each indicator contains multiple refined indicators.
  • the hard disk status information includes the chip version, status, cache status data, RAID level status of the RAID card, etc.
  • power status information includes voltage, power consumption, power running status, power failure, power supply quantity, etc.
  • temperature status information includes CPU temperature, motherboard temperature, fan temperature, hard disk temperature, room temperature, etc.
  • the indicators include two types of numerical data indicators and non-numerical data indicators.
  • numerical data indicators a corresponding threshold value needs to be set in advance, and after comparing with the threshold value, it is judged whether there is an abnormality; for non-numerical data indicators, a fault is detected ( Can not work) directly alarm. For example, when the power supply is monitored, it will directly alarm; while the temperature and voltage are numerical data, and there are alarm thresholds, and the thresholds are set according to the actual situation according to the standards and requirements of the computer room.
  • a corresponding topology structure such as a tree structure
  • the threshold set for each index is also stored in the corresponding position of the topology structure, that is, the hardware device corresponding to the threshold and the node position of the index in the topology structure.
  • Step S402 Collect each index data of the hardware device in a preset manner.
  • IPMI information is communicated through the BMC located on the hardware components of the IPMI specification.
  • Using low-level hardware intelligent management instead of operating system management has two main advantages: First, this configuration allows out-of-band server management; second, the operating system does not have to bear the task of transmitting system state data. Users can use the IPMI interface to monitor the physical health characteristics of hardware devices, such as temperature, voltage, fan working status, power supply status, etc. This standard applies to different server topologies, as well as Windows, Linux, Solaris, Mac or mixed operating systems. In addition, since IPMI can operate under different attribute values, IPMI can still operate normally even if the server itself does not operate normally or cannot provide services for any reason. Therefore, through the IPMI protocol interface, it is possible to collect and monitor the various index data of the hardware device.
  • the corresponding tree structure is configured for the multiple hardware devices that need to be monitored and the indicators that each hardware device needs to monitor, and each node is traversed according to the tree structure, and each hardware device Send the corresponding IPMI command to collect the data of the corresponding indicator.
  • the SaltStack technology can be used for remote management, and the various indicator data can be collected remotely in combination with the IPMI interface.
  • Step S404 Obtain thresholds set by various indicators.
  • the thresholds set for the various indicators are obtained respectively.
  • a corresponding alarm threshold is preset, and the threshold needs to be acquired; while for the non-numerical data indicator, the corresponding threshold does not need to be acquired, and the alarm is directly triggered when a fault is detected.
  • Step S406 Compare the collected index data with corresponding thresholds, and judge whether there is an abnormality.
  • the various index data of the hardware device collected through the IPMI interface are respectively compared with the obtained corresponding thresholds.
  • the numerical data index according to the standards and requirements of the computer room, when the collected data exceeds or falls below the corresponding threshold (selected according to actual needs), it is judged that the index is abnormal.
  • the non-numerical data indicator when the collected data indicates that a fault is found, it is directly judged that the indicator is abnormal.
  • step S408 when an indicator is abnormal, an early warning notification is issued in a preset manner.
  • an early warning notification can be issued through a variety of preset methods.
  • the alarm display of the various indicators can be performed in the form of a page, so as to realize a visual early warning.
  • the hardware device abnormality monitoring method can use the IPMI interface combined with the SaltStack management tool to implement unified batch management and abnormal monitoring of hardware devices, uniformly set the indicators that the hardware devices need to monitor and the thresholds corresponding to each indicator, through IPMI
  • the interface collects various indicator data of the hardware device, compares the collected various indicator data with corresponding thresholds, and judges whether there is an abnormality, and when an indicator is abnormal, a visual warning is given through the page.
  • the method can customize the hardware devices that need to be monitored and corresponding indicators and thresholds based on specific business scenarios, and perform customized analysis and detect alarms according to the configuration for each indicator data collected through the IPMI interface. For indicators that determine abnormalities, visual management, configuration, and query are achieved through the monitoring display platform, which facilitates the discovery and processing of monitoring, improves the timeliness of finding abnormalities, and the efficiency of handling abnormalities.
  • steps S500-S508 of the hardware device abnormality monitoring method are similar to steps S400-S408 of the first embodiment, except that the method further includes steps S510-S512.
  • the method includes the following steps:
  • step S500 the indicators to be monitored by the hardware device and the thresholds corresponding to each indicator are set.
  • IPMI is used in combination with the SaltStack management tool to implement unified batch management and abnormal monitoring of hardware devices.
  • IPMI is an open standard hardware management interface specification that defines a specific method for embedded management subsystems to communicate. It is an industrial standard used to manage peripheral devices used in enterprise systems based on Intel architecture.
  • the SaltStack management tool allows administrators to create a consistent management system for multiple operating systems.
  • the three major functions of SaltStack include remote execution, configuration management, and cloud management.
  • SaltStack acts on the slave and master topologies.
  • SaltStack can be executed in one or more subordinates in combination with specific commands.
  • SaltStack allows administrators to use "grain".
  • the grain can run remote queries on the SaltStack servants, thus collecting the state information of the servants and allowing the administrator to store the information in a central location.
  • SaltStack can also help administrators define the desired state on the target system. These states will be used in the application. Sls file, which contains very specific requirements on how to obtain the required state on the system.
  • the indicators may include status information such as power supply, temperature, voltage, fan, battery, processor, memory, hard disk, and logs.
  • each indicator contains multiple refined indicators.
  • the hard disk status information includes the chip version, status, cache status data, RAID level status of the RAID card, etc.
  • power status information includes voltage, power consumption, power running status, power failure, power supply quantity, etc.
  • temperature status information includes CPU temperature, motherboard temperature, fan temperature, hard disk temperature, room temperature, etc.
  • the indicators include two types of numerical data indicators and non-numerical data indicators.
  • numerical data indicators a corresponding threshold value needs to be set in advance, and after comparing with the threshold value, it is judged whether there is an abnormality; for non-numerical data indicators, a fault is detected ( Can not work) directly alarm. For example, when the power supply is monitored, it will directly alarm; while the temperature and voltage are numerical data, and there are alarm thresholds, and the thresholds are set according to the actual situation according to the standards and requirements of the computer room.
  • a corresponding topology structure such as a tree structure
  • the threshold set for each index is also stored in the corresponding position of the topology structure, that is, the hardware device corresponding to the threshold and the node position of the index in the topology structure.
  • Step S502 Collect various index data of the hardware device in a preset manner.
  • IPMI information is communicated through the BMC located on the hardware components of the IPMI specification.
  • Using low-level hardware intelligent management instead of operating system management has two main advantages: First, this configuration allows out-of-band server management; second, the operating system does not have to bear the task of transmitting system state data. Users can use the IPMI interface to monitor the physical health characteristics of hardware devices, such as temperature, voltage, fan working status, power supply status, etc. This standard applies to different server topologies, as well as Windows, Linux, Solaris, Mac or mixed operating systems. In addition, since IPMI can operate under different attribute values, IPMI can still operate normally even if the server itself does not operate normally or cannot provide services for any reason. Therefore, through the IPMI protocol interface, it is possible to collect and monitor the various index data of the hardware device.
  • the corresponding tree structure is configured for the multiple hardware devices that need to be monitored and the indicators that each hardware device needs to monitor, and each node is traversed according to the tree structure, and each hardware device Send the corresponding IPMI command to collect the data of the corresponding indicator.
  • the SaltStack technology can be used for remote management, and the various indicator data can be collected remotely in combination with the IPMI interface.
  • Step S504 Obtain thresholds set by various indicators.
  • the thresholds set for the various indicators are obtained respectively.
  • a corresponding alarm threshold is preset, and the threshold needs to be acquired; while for the non-numerical data indicator, the corresponding threshold does not need to be acquired, and the alarm is directly triggered when a fault is detected.
  • Step S506 Compare the collected index data with corresponding thresholds, and judge whether there is an abnormality.
  • the various index data of the hardware device collected through the IPMI interface are respectively compared with the obtained corresponding thresholds.
  • the numerical data index according to the standards and requirements of the computer room, when the collected data exceeds or falls below the corresponding threshold (selected according to actual needs), it is judged that the index is abnormal.
  • the non-numerical data indicator when the collected data indicates that a fault is found, it is directly judged that the indicator is abnormal.
  • step S508 when an indicator is abnormal, an early warning notification is issued in a preset manner.
  • an early warning notification can be issued through a variety of preset methods.
  • the alarm display of the various indicators can be performed in the form of a page, so as to realize a visual early warning.
  • Step S510 Record processing feedback information for the abnormality.
  • the processing feedback information may include the reason for the abnormality, the processing time, the processing process, the processing result, the processing person, and the like.
  • step S512 the collected index data and processing feedback information are saved for retrospective operation.
  • the collected index data is stored in the database, so all the data can be queried.
  • the processing feedback information is recorded, it can be traced back.
  • storing the collected index data and corresponding processing feedback information in the database can be used as a basis for subsequent provision of retrospective services.
  • users need to backtrack they can query each index data recorded and the corresponding processing feedback information for further processing such as statistical analysis and hardware optimization, which is helpful for better equipment monitoring and improvement.
  • the hardware device abnormality monitoring method provided in this embodiment can apply the IPMI interface combined with the saltstack technology to realize remote and batch management and data collection of the IPMI underlying hardware devices.
  • it records the processing feedback information of the abnormality, saves the collected index data and processing feedback information, and provides retrospective services, which facilitates the subsequent retrieval of the data and feedback information for statistical analysis and optimization, etc. .
  • visual management, configuration, and query are achieved through the monitoring display platform, which facilitates the discovery, processing and backtracking of monitoring, and improves the timeliness of finding abnormalities and the efficiency of handling abnormalities.
  • steps S600-S612 of the hardware device abnormality monitoring method are similar to steps S500-S512 of the second embodiment (not described here), the difference is that the method further includes steps S614-S616. among them:
  • Step S614 According to the processing feedback information of the historical record, a preferred processing solution for the abnormality is screened out.
  • the corresponding processing feedback information will be recorded and stored in the database for query. Therefore, when it is determined that an indicator is abnormal, the abnormality can be queried from the database.
  • Corresponding to each historical record For example, when the motherboard temperature is abnormally high, you can query the processing feedback information every time the motherboard temperature is too high in the past. Then, according to the processing process and processing result in each historical record that is queried, a preferred processing scheme for the abnormality is screened out.
  • the preferred processing solution may be a processing solution in the history record that has a successful processing result and the shortest processing time.
  • the processing feedback information (not limited to local historical records) of other computer rooms for the abnormality can also be queried through the network or big data, and the preferred processing solution can be filtered out.
  • Step S616 Prompt the user with the preferred processing solution, so that the user can refer to and handle the exception.
  • the preferred treatment plan is prompted to the user in a preset manner (for example, it is displayed in the form of a page).
  • the content of the preferred treatment solution includes the abnormal cause and processing method corresponding to the abnormality.
  • the user can learn the preferred treatment plan for the abnormality according to the prompt, so as to handle the abnormality by himself, without contacting and waiting for after-sales personnel to handle it.
  • the hardware device abnormality monitoring method provided in this embodiment can provide corresponding reliable and preferred treatment solutions for various abnormalities found based on the processing feedback information of historical records, which improves the efficiency and accuracy of abnormality repair, and saves time and manpower. Reduce the maintenance cost of hardware equipment in the computer room.
  • the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores A hardware device abnormality monitoring program, the hardware device abnormality monitoring program can be executed by at least one processor, so that the at least one processor executes the steps of the above hardware device abnormality monitoring method.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种硬件监控技术,涉及一种硬件设备异常监控方法,该方法包括:通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;通过IPMI接口分别采用预设方式采集所述硬件设备的各项指标数据;获取各项指标所设置的阈值;比较所采集的各项指标数据与对应阈值,判断是否出现异常;当有指标出现异常时,按预设方式发出预警通知。还提供一种服务器及计算机可读存储介质。该硬件设备异常监控方法、服务器及计算机可读存储介质能够提高发现异常的及时性,以及处理异常的效率。

Description

硬件设备异常监控方法、装置、服务器及计算机可读存储介质
本申请要求于2019年10月11日提交中国专利局、申请号为201910967009.2,发明名称为“硬件设备异常监控方法、服务器及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及硬件监控技术领域,尤其涉及一种硬件设备异常监控方法、装置、服务器及计算机可读存储介质。
背景技术
随着网络技术的不断进步和信息化进程的加快,机房设备种类和数量越来越多,相应的运维工作也越来越重,如何又快又准确的发现硬件故障问题,成为运维工作中亟需解决的问题。目前,针对各种硬件设备的异常检测,主要方式是人工巡检或是等待机器出现异常后再发现问题、处理问题。
技术问题
发明人意识到上述方式不仅浪费人力,而且效率低下,存在一定的滞后性。若发现问题后运维人员没有可靠的排除故障的方法,完全依赖于联系售后人员来解决问题,不能及时对发现的问题进行修复处理,也会影响硬件设备的工作进度,造成较大损失。另外,随着网络中的设备数量越来越多,运维人员已经不可能走进机房管理每台机器,如何有效进行远程控制管理也就愈加重要。
技术解决方案
本申请提出一种硬件设备异常监控方法,该方法包括步骤:
通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
通过智能平台管理接口IPMI分别采用预设方式采集所述硬件设备的各项指标数据;
获取各项指标所设置的阈值;
比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
当有指标出现异常时,按预设方式发出预警通知。
本申请提供一种硬件设备异常监控装置,该装置包括:
设置模块:用于通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
采集模块:用于通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据;
获取模块:用于获取各项指标所设置的阈值;
判断模块:用于比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
通知模块:用于当有指标出现异常时,按预设方式发出预警通知。
本申请还提供一种服务器,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的硬件设备异常监控程序,所述硬件设备异常监控程序被所述处理器执行时实现如下步骤:
通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
通过智能平台管理接口IPMI分别采用预设方式采集所述硬件设备的各项指标数据;
获取各项指标所设置的阈值;
比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
当有指标出现异常时,按预设方式发出预警通知。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有硬件设备异常监控程序,所述硬件设备异常监控程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
通过智能平台管理接口IPMI分别采用预设方式采集所述硬件设备的各项指标数据;
获取各项指标所设置的阈值;
比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
当有指标出现异常时,按预设方式发出预警通知。
附图说明
图1是本申请服务器一可选的硬件架构的示意图;
图2是本申请硬件设备异常监控装置第一实施例的程序模块示意图;
图3是本申请硬件设备异常监控装置第二实施例的程序模块示意图;
图4是本申请硬件设备异常监控装置第三实施例的程序模块示意图;
图5是本申请硬件设备异常监控方法第一实施例的流程示意图;
图6是本申请硬件设备异常监控方法第二实施例的流程示意图;
图7是本申请硬件设备异常监控方法第三实施例的流程示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,是本申请服务器2一可选的硬件架构的示意图。
本实施例中,所述服务器2可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的服务器2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,所述服务器2可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该服务器2可以是独立的服务器,也可以是多个服务器所组成的服务器集群。
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述服务器2的内部存储单元,例如该服务器2的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述服务器2的外部存储设备,例如该服务器2上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述服务器2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述服务器2的操作系统和各类应用软件,例如硬件设备异常监控装置200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器2的总体操作。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的硬件设备异常监控装置200的程序代码等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述服务器2与其他电子设备之间建立通信连接。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
首先,本申请提出一种硬件设备异常监控装置200。
参阅图2所示,是本申请硬件设备异常监控装置200第一实施例的程序模块图。
本实施例中,所述硬件设备异常监控装置200包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的硬件设备异常监控操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,硬件设备异常监控装置200可以被划分为一个或多个模块。例如,在图2中,所述硬件设备异常监控装置200可以被分割成设置模块201、采集模块202、获取模块203、判断模块204、通知模块205。其中:
所述设置模块201,用于设置硬件设备需要监控的指标和各项指标对应的阈值。
在本实施例中,应用智能平台管理接口(Intelligent Platform Management Interface,IPMI)结合SaltStack管理工具实现对硬件设备的统一的批量管理和异常监控。IPMI是一种开放标准的硬件管理接口规格,定义了嵌入式管理子系统进行通信的特定方法,是管理基于Intel结构的企业系统中所使用的外围设备采用的一种工业标准。
SaltStack管理工具允许管理员对多个操作系统创建一个一致的管理系统,SaltStack的三大功能包括远程执行、配置管理和云管理。SaltStack作用于仆从和主拓扑。SaltStack与特定的命令结合使用可以在一个或多个下属执行。除了运行远程命令,SaltStack允许管理员使用“grain”。grain可以在SaltStack仆从运行远程查询,因此收集仆从的状态信息并允许管理员在一个中央位置存储信息。SaltStack也可以帮助管理员定义目标系统上的期望状态。这些状态在应用时会用到.sls文件,其中包含了如何在系统上获得所需的状态非常具体的要求。
针对IPMI规格下的底层硬件设备,所述指标可以包含电源、温度、电压、风扇、电池、处理器、内存、硬盘、日志等状态信息。其中,每项指标下面又包含多个细化指标。例如,硬盘状态信息包括RAID卡的芯片版本、状态、缓存状态数据、RAID级别状态等;电源状态信息包括电压、功耗、电源运行状态、出现失电、电源在位数量等;温度状态信息包括CPU温度、主板温度、风扇温度、硬盘温度以及室温等。
并且,所述指标包括数值数据指标和非数值数据指标两类,针对数值数据指标,需要预先设置对应的阈值,在与阈值进行比较后判断是否出现异常;针对非数值数据指标,监测到故障(不能正常工作)时直接报警。例如,监测到电源出现失电时直接报警;而温度和电压等是数值数据,有报警阀值,所述阀值是根据机房的标准和要求,按照实际情况进行设置。
通过SaltStack管理工具,可以对多个所述硬件设备统一进行批量设置。在本实施例中,可以针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的拓扑结构,例如树形结构等。而针对各项指标所设置的阈值,也保存在所述拓扑结构的相应位置,即所述拓扑结构中与该阈值对应的硬件设备和指标所在的节点位置。
所述采集模块202,用于分别通过预设方式采集所述硬件设备的各项指标数据。
在本实施例中,通过IPMI接口采集所述硬件设备的各项指标数据。IPMI信息通过位于IPMI规格的硬件组件上的基板管理控制器(Baseboard Management Controller,BMC)进行交流。使用低级硬件智能管理而不使用操作系统进行管理,具有两个主要优点:首先,此配置允许进行带外服务器管理;其次,操作系统不必负担传输系统状态数据的任务。用户可以利用IPMI接口监视硬件设备的物理健康特征,如温度、电压、风扇工作状态、电源状态等。此标准适用于不同的服务器拓扑学,以及Windows、Linux、Solaris、Mac或是混合型的操作系统。此外,由于IPMI可在不同的属性值下运作,即使服务器本身的运作不正常,或是由于任何原因而无法提供服务,IPMI仍可正常运作。因此,通过IPMI协议接口,可以对硬件设备进行所述各项指标数据的采集和监控。
在本实施例中,针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的树形结构,所述采集模块202根据所述树形结构遍历其中的每个节点,向每个硬件设备发送相应的IPMI命令,以进行对应指标的数据采集。
对于远程设备,可以通过SaltStack技术进行远程管理,结合所述IPMI接口远程采集所述各项指标数据。
所述获取模块203,用于获取各项指标所设置的阈值。
具体地,当采集到所述各项指标数据后,分别获取针对所述各项指标所设置的阈值。在本实施例中,针对所述数值数据指标,预先设置有对应的报警阈值,需要获取所述阈值;而针对所述非数值数据指标,不需要获取对应阈值,监测到故障时直接报警。
所述判断模块204,用于比较所采集的各项指标数据与对应阈值,判断是否出现异常。
具体地,将通过IPMI接口采集的所述硬件设备的各项指标数据分别与所获取的对应阈值进行比较。针对所述数值数据指标,根据机房的标准和要求,在所采集到的数据超过或低于(根据实际需要选择)对应阈值时,判断该指标出现异常。针对所述非数值数据指标,在所采集到的数据为发现故障时,直接判断该指标出现异常。
所述通知模块205,用于当有指标出现异常时,按预设方式发出预警通知。
具体地,针对判断出现异常的指标,可以通过多种预设的方式发出预警通知。在本实施例中,可以采用页面的方式对所述各项指标进行报警展示,实现可视化预警。
本实施例提供的硬件设备异常监控装置,可以应用IPMI接口结合SaltStack管理工具实现对硬件设备的统一的批量管理和异常监控,统一设置硬件设备需要监控的指标和各项指标对应的阈值,通过IPMI接口采集所述硬件设备的各项指标数据,比较所采集的各项指标数据与对应阈值,判断是否出现异常,当有指标出现异常时,通过页面进行可视化预警。该系统可以基于特定业务场景对需要监控的硬件设备及相应指标和阈值进行自定义配置,针对通过IPMI接口采集的各项指标数据,根据所述配置进行定制化分析并检测报警。对于判断出现异常的指标,通过监控展示平台来达到可视化管理和配置、查询,便于监控的发现、处理,提高了发现异常的及时性,以及处理异常的效率。
参阅图3所示,是本申请硬件设备异常监控装置200第二实施例的程序模块图。本实施例中,所述的硬件设备异常监控装置200除了包括第一实施例中的所述设置模块201、采集模块202、获取模块203、判断模块204、通知模块205之外,还包括记录模块206、回溯模块207。
所述记录模块206,用于记录对所述异常的处理反馈信息。
具体地,在监控报警页面出现异常预警后,则会由相关的运维人员跟进协调沟通相关人员(例如售后人员)进行进一步的排查和处理异常。当处理完成后,可以将处理记录反馈至系统中,记录对应的处理反馈信息。所述处理反馈信息可以包括异常原因、处理时间、处理过程、处理结果、处理人等。
所述回溯模块207,用于保存所采集的各项指标数据和处理反馈信息,以便进行回溯操作。
具体地,所采集的所述各项指标数据都是存入数据库的,所以所有的数据都是可查询的。至于处理过程,有的是通过邮件、有的是通过事件发起,有的是电话沟通,当记录处理反馈信息后,也是可以回溯的。在本实施例中,在数据库中保存所采集的所述各项指标数据以及对应的处理反馈信息,可以作为后续提供回溯服务的基础。当用户需要进行回溯时,可以查询到所记录的每一项指标数据以及对应的处理反馈信息,以便进行统计分析、硬件优化等进一步处理,有助于更好地进行设备监控和改进。
本实施例提供的硬件设备异常监控装置,可以应用IPMI接口结合saltstack技术,实现对IPMI底层硬件设备的远程及批量管理和采集数据。并且,记录对所述异常的处理反馈信息,并保存所采集的各项指标数据和处理反馈信息,提供回溯服务,便于后续进行回溯时查询所述数据和反馈信息,以进行统计分析和优化等。对于判断出现异常的指标,通过监控展示平台来达到可视化管理和配置、查询,便于监控的发现、处理和回溯,提高了发现异常的及时性,以及处理异常的效率。
参阅图4所示,是本申请硬件设备异常监控装置200第三实施例的程序模块图。本实施例中,所述的硬件设备异常监控装置200除了包括第二实施例中的所述设置模块201、采集模块202、获取模块203、判断模块204、通知模块205、记录模块206、回溯模块207之外,还包括筛选模块208、提示模块209。
所述筛选模块208,用于当有指标出现异常时,根据历史记录的所述处理反馈信息,筛选出所述异常的优选处理方案。
具体地,由于之前每一次发现和处理各个指标的异常之后,都会记录对应的处理反馈信息,并存入数据库以供查询,因此当判断出有指标出现异常时,可以从数据库中查询所述异常对应的每条历史记录。例如,当出现主板温度过高的异常时,可以查询到以往每次主板温度过高时的处理反馈信息。然后,根据所查询到的每条历史记录中的处理过程和处理结果等信息,筛选出所述异常的优选处理方案。所述优选处理方案可以是历史记录中处理结果为成功且处理时间最短的处理方案。
在其他实施例中,所述筛选模块208还可以根据预先设置的异常问题、异常原因、优选处理方案之间的映射关系表(可以由设备供应商提供),直接查询出所述异常对应的优选处理方案。或者,所述筛选模块208也可以通过网络或者大数据查询其他机房针对所述异常的处理反馈信息(不局限于本地的历史记录),从中筛选出所述优选处理方案。
所述提示模块209,用于向用户提示所述优选处理方案,以便用户参照处理所述异常。
具体地,当所述筛选模块208筛选出所述异常对应的优选处理方案后,通过预设方式向用户提示所述优选处理方案(例如采用页面的方式进行展示)。所述优选处理方案的内容包括所述异常对应的异常原因和处理方法等。
假设出现异常时用户只能获得预警,但是对于其故障的排除没有明确和有效的方法,这时候基本上都要联系售后客服来对此问题进行解决,且对于一些复杂的异常,售后人员可能也不太能快速定位并解决该问题,这样既浪费时间又浪费人力,并提高了维护成本。而本实施例中,用户根据提示,可以获知所述异常的优选处理方案,从而自行处理所述异常,而不需要联系和等待售后人员进行处理。
本实施例提供的硬件设备异常监控装置,可以根据历史记录的处理反馈信息,针对发现的各种异常提供对应的可靠的优选处理方案,提高了异常修复的效率和准确性,节省时间和人力,降低了机房硬件设备的维护成本。
此外,本申请还提出一种硬件设备异常监控方法。
参阅图5所示,是本申请硬件设备异常监控方法第一实施例的流程示意图。在本实施例中,根据不同的需求,图5所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
该方法包括以下步骤:
步骤S400,设置硬件设备需要监控的指标和各项指标对应的阈值。
在本实施例中,应用IPMI结合SaltStack管理工具实现对硬件设备的统一的批量管理和异常监控。IPMI是一种开放标准的硬件管理接口规格,定义了嵌入式管理子系统进行通信的特定方法,是管理基于Intel结构的企业系统中所使用的外围设备采用的一种工业标准。
SaltStack管理工具允许管理员对多个操作系统创建一个一致的管理系统,SaltStack的三大功能包括远程执行、配置管理和云管理。SaltStack作用于仆从和主拓扑。SaltStack与特定的命令结合使用可以在一个或多个下属执行。除了运行远程命令,SaltStack允许管理员使用“grain”。grain可以在SaltStack仆从运行远程查询,因此收集仆从的状态信息并允许管理员在一个中央位置存储信息。SaltStack也可以帮助管理员定义目标系统上的期望状态。这些状态在应用时会用到.sls文件,其中包含了如何在系统上获得所需的状态非常具体的要求。
针对IPMI规格下的底层硬件设备,所述指标可以包含电源、温度、电压、风扇、电池、处理器、内存、硬盘、日志等状态信息。其中,每项指标下面又包含多个细化指标。例如,硬盘状态信息包括RAID卡的芯片版本、状态、缓存状态数据、RAID级别状态等;电源状态信息包括电压、功耗、电源运行状态、出现失电、电源在位数量等;温度状态信息包括CPU温度、主板温度、风扇温度、硬盘温度以及室温等。
并且,所述指标包括数值数据指标和非数值数据指标两类,针对数值数据指标,需要预先设置对应的阈值,在与阈值进行比较后判断是否出现异常;针对非数值数据指标,监测到故障(不能正常工作)时直接报警。例如,监测到电源出现失电时直接报警;而温度和电压等是数值数据,有报警阀值,所述阀值是根据机房的标准和要求,按照实际情况进行设置。
通过SaltStack管理工具,可以对多个所述硬件设备统一进行批量设置。在本实施例中,可以针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的拓扑结构,例如树形结构等。而针对各项指标所设置的阈值,也保存在所述拓扑结构的相应位置,即所述拓扑结构中与该阈值对应的硬件设备和指标所在的节点位置。
步骤S402,分别通过预设方式采集所述硬件设备的各项指标数据。
在本实施例中,通过IPMI接口采集所述硬件设备的各项指标数据。IPMI信息通过位于IPMI规格的硬件组件上的BMC进行交流。使用低级硬件智能管理而不使用操作系统进行管理,具有两个主要优点:首先,此配置允许进行带外服务器管理;其次,操作系统不必负担传输系统状态数据的任务。用户可以利用IPMI接口监视硬件设备的物理健康特征,如温度、电压、风扇工作状态、电源状态等。此标准适用于不同的服务器拓扑学,以及Windows、Linux、Solaris、Mac或是混合型的操作系统。此外,由于IPMI可在不同的属性值下运作,即使服务器本身的运作不正常,或是由于任何原因而无法提供服务,IPMI仍可正常运作。因此,通过IPMI协议接口,可以对硬件设备进行所述各项指标数据的采集和监控。
在本实施例中,针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的树形结构,根据所述树形结构遍历其中的每个节点,向每个硬件设备发送相应的IPMI命令,以进行对应指标的数据采集。
对于远程设备,可以通过SaltStack技术进行远程管理,结合所述IPMI接口远程采集所述各项指标数据。
步骤S404,获取各项指标所设置的阈值。
具体地,当采集到所述各项指标数据后,分别获取针对所述各项指标所设置的阈值。在本实施例中,针对所述数值数据指标,预先设置有对应的报警阈值,需要获取所述阈值;而针对所述非数值数据指标,不需要获取对应阈值,监测到故障时直接报警。
步骤S406,比较所采集的各项指标数据与对应阈值,判断是否出现异常。
具体地,将通过IPMI接口采集的所述硬件设备的各项指标数据分别与所获取的对应阈值进行比较。针对所述数值数据指标,根据机房的标准和要求,在所采集到的数据超过或低于(根据实际需要选择)对应阈值时,判断该指标出现异常。针对所述非数值数据指标,在所采集到的数据为发现故障时,直接判断该指标出现异常。
步骤S408,当有指标出现异常时,按预设方式发出预警通知。
具体地,针对判断出现异常的指标,可以通过多种预设的方式发出预警通知。在本实施例中,可以采用页面的方式对所述各项指标进行报警展示,实现可视化预警。
本实施例提供的硬件设备异常监控方法,可以应用IPMI接口结合SaltStack管理工具实现对硬件设备的统一的批量管理和异常监控,统一设置硬件设备需要监控的指标和各项指标对应的阈值,通过IPMI接口采集所述硬件设备的各项指标数据,比较所采集的各项指标数据与对应阈值,判断是否出现异常,当有指标出现异常时,通过页面进行可视化预警。该方法可以基于特定业务场景对需要监控的硬件设备及相应指标和阈值进行自定义配置,针对通过IPMI接口采集的各项指标数据,根据所述配置进行定制化分析并检测报警。对于判断出现异常的指标,通过监控展示平台来达到可视化管理和配置、查询,便于监控的发现、处理,提高了发现异常的及时性,以及处理异常的效率。
如图6所示,是本申请硬件设备异常监控方法的第二实施例的流程示意图。本实施例中,所述硬件设备异常监控方法的步骤S500-S508与第一实施例的步骤S400-S408相类似,区别在于该方法还包括步骤S510-S512。
该方法包括以下步骤:
步骤S500,设置硬件设备需要监控的指标和各项指标对应的阈值。
在本实施例中,应用IPMI结合SaltStack管理工具实现对硬件设备的统一的批量管理和异常监控。IPMI是一种开放标准的硬件管理接口规格,定义了嵌入式管理子系统进行通信的特定方法,是管理基于Intel结构的企业系统中所使用的外围设备采用的一种工业标准。
SaltStack管理工具允许管理员对多个操作系统创建一个一致的管理系统,SaltStack的三大功能包括远程执行、配置管理和云管理。SaltStack作用于仆从和主拓扑。SaltStack与特定的命令结合使用可以在一个或多个下属执行。除了运行远程命令,SaltStack允许管理员使用“grain”。grain可以在SaltStack仆从运行远程查询,因此收集仆从的状态信息并允许管理员在一个中央位置存储信息。SaltStack也可以帮助管理员定义目标系统上的期望状态。这些状态在应用时会用到.sls文件,其中包含了如何在系统上获得所需的状态非常具体的要求。
针对IPMI规格下的底层硬件设备,所述指标可以包含电源、温度、电压、风扇、电池、处理器、内存、硬盘、日志等状态信息。其中,每项指标下面又包含多个细化指标。例如,硬盘状态信息包括RAID卡的芯片版本、状态、缓存状态数据、RAID级别状态等;电源状态信息包括电压、功耗、电源运行状态、出现失电、电源在位数量等;温度状态信息包括CPU温度、主板温度、风扇温度、硬盘温度以及室温等。
并且,所述指标包括数值数据指标和非数值数据指标两类,针对数值数据指标,需要预先设置对应的阈值,在与阈值进行比较后判断是否出现异常;针对非数值数据指标,监测到故障(不能正常工作)时直接报警。例如,监测到电源出现失电时直接报警;而温度和电压等是数值数据,有报警阀值,所述阀值是根据机房的标准和要求,按照实际情况进行设置。
通过SaltStack管理工具,可以对多个所述硬件设备统一进行批量设置。在本实施例中,可以针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的拓扑结构,例如树形结构等。而针对各项指标所设置的阈值,也保存在所述拓扑结构的相应位置,即所述拓扑结构中与该阈值对应的硬件设备和指标所在的节点位置。
步骤S502,分别通过预设方式采集所述硬件设备的各项指标数据。
在本实施例中,通过IPMI接口采集所述硬件设备的各项指标数据。IPMI信息通过位于IPMI规格的硬件组件上的BMC进行交流。使用低级硬件智能管理而不使用操作系统进行管理,具有两个主要优点:首先,此配置允许进行带外服务器管理;其次,操作系统不必负担传输系统状态数据的任务。用户可以利用IPMI接口监视硬件设备的物理健康特征,如温度、电压、风扇工作状态、电源状态等。此标准适用于不同的服务器拓扑学,以及Windows、Linux、Solaris、Mac或是混合型的操作系统。此外,由于IPMI可在不同的属性值下运作,即使服务器本身的运作不正常,或是由于任何原因而无法提供服务,IPMI仍可正常运作。因此,通过IPMI协议接口,可以对硬件设备进行所述各项指标数据的采集和监控。
在本实施例中,针对需要监控的多个所述硬件设备以及每个硬件设备需要监控的指标配置相应的树形结构,根据所述树形结构遍历其中的每个节点,向每个硬件设备发送相应的IPMI命令,以进行对应指标的数据采集。
对于远程设备,可以通过SaltStack技术进行远程管理,结合所述IPMI接口远程采集所述各项指标数据。
步骤S504,获取各项指标所设置的阈值。
具体地,当采集到所述各项指标数据后,分别获取针对所述各项指标所设置的阈值。在本实施例中,针对所述数值数据指标,预先设置有对应的报警阈值,需要获取所述阈值;而针对所述非数值数据指标,不需要获取对应阈值,监测到故障时直接报警。
步骤S506,比较所采集的各项指标数据与对应阈值,判断是否出现异常。
具体地,将通过IPMI接口采集的所述硬件设备的各项指标数据分别与所获取的对应阈值进行比较。针对所述数值数据指标,根据机房的标准和要求,在所采集到的数据超过或低于(根据实际需要选择)对应阈值时,判断该指标出现异常。针对所述非数值数据指标,在所采集到的数据为发现故障时,直接判断该指标出现异常。
步骤S508,当有指标出现异常时,按预设方式发出预警通知。
具体地,针对判断出现异常的指标,可以通过多种预设的方式发出预警通知。在本实施例中,可以采用页面的方式对所述各项指标进行报警展示,实现可视化预警。
步骤S510,记录对所述异常的处理反馈信息。
具体地,在监控报警页面出现异常预警后,则会由相关的运维人员跟进协调沟通相关人员(例如售后人员)进行进一步的排查和处理异常。当处理完成后,可以将处理记录反馈至系统中,记录对应的处理反馈信息。所述处理反馈信息可以包括异常原因、处理时间、处理过程、处理结果、处理人等。
步骤S512,保存所采集的各项指标数据和处理反馈信息,以便进行回溯操作。
具体地,所采集的所述各项指标数据都是存入数据库的,所以所有的数据都是可查询的。至于处理过程,有的是通过邮件、有的是通过事件发起,有的是电话沟通,当记录处理反馈信息后,也是可以回溯的。在本实施例中,在数据库中保存所采集的所述各项指标数据以及对应的处理反馈信息,可以作为后续提供回溯服务的基础。当用户需要进行回溯时,可以查询到所记录的每一项指标数据以及对应的处理反馈信息,以便进行统计分析、硬件优化等进一步处理,有助于更好地进行设备监控和改进。
本实施例提供的硬件设备异常监控方法,可以应用IPMI接口结合saltstack技术,实现对IPMI底层硬件设备的远程及批量管理和采集数据。并且,记录对所述异常的处理反馈信息,并保存所采集的各项指标数据和处理反馈信息,提供回溯服务,便于后续进行回溯时查询所述数据和反馈信息,以进行统计分析和优化等。对于判断出现异常的指标,通过监控展示平台来达到可视化管理和配置、查询,便于监控的发现、处理和回溯,提高了发现异常的及时性,以及处理异常的效率。
如图7所示,是本申请硬件设备异常监控方法的第三实施例的流程示意图。本实施例中,所述硬件设备异常监控方法的步骤S600-S612与第二实施例的步骤S500-S512相类似(不再赘述),区别在于该方法还包括步骤S614-S616。其中:
步骤S614,根据历史记录的所述处理反馈信息,筛选出所述异常的优选处理方案。
具体地,由于之前每一次发现和处理各个指标的异常之后,都会记录对应的处理反馈信息,并存入数据库以供查询,因此当判断出有指标出现异常时,可以从数据库中查询所述异常对应的每条历史记录。例如,当出现主板温度过高的异常时,可以查询到以往每次主板温度过高时的处理反馈信息。然后,根据所查询到的每条历史记录中的处理过程和处理结果等信息,筛选出所述异常的优选处理方案。所述优选处理方案可以是历史记录中处理结果为成功且处理时间最短的处理方案。
在其他实施例中,还可以根据预先设置的异常问题、异常原因、优选处理方案之间的映射关系表(可以由设备供应商提供),直接查询出所述异常对应的优选处理方案。或者,也可以通过网络或者大数据查询其他机房针对所述异常的处理反馈信息(不局限于本地的历史记录),从中筛选出所述优选处理方案。
步骤S616,向用户提示所述优选处理方案,以便用户参照处理所述异常。
具体地,当筛选出所述异常对应的优选处理方案后,通过预设方式向用户提示所述优选处理方案(例如采用页面的方式进行展示)。所述优选处理方案的内容包括所述异常对应的异常原因和处理方法等。
假设出现异常时用户只能获得预警,但是对于其故障的排除没有明确和有效的方法,这时候基本上都要联系售后客服来对此问题进行解决,且对于一些复杂的异常,售后人员可能也不太能快速定位并解决该问题,这样既浪费时间又浪费人力,并提高了维护成本。而本实施例中,用户根据提示,可以获知所述异常的优选处理方案,从而自行处理所述异常,而不需要联系和等待售后人员进行处理。
本实施例提供的硬件设备异常监控方法,可以根据历史记录的处理反馈信息,针对发现的各种异常提供对应的可靠的优选处理方案,提高了异常修复的效率和准确性,节省时间和人力,降低了机房硬件设备的维护成本。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有硬件设备异常监控程序,所述硬件设备异常监控程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的硬件设备异常监控方法的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种硬件设备异常监控方法,其中,所述方法包括步骤:
    通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
    通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据;
    获取各项指标所设置的阈值;
    比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
    当有指标出现异常时,按预设方式发出预警通知。
  2. 如权利要求1所述的硬件设备异常监控方法,其中,该方法还包括步骤:
    记录对所述异常的处理反馈信息;
    保存所采集的所述各项指标数据和所述处理反馈信息,以便进行回溯操作。
  3. 如权利要求2所述的硬件设备异常监控方法,其中,该方法还包括步骤:
    根据历史记录的所述处理反馈信息,筛选出所述异常的优选处理方案;
    向用户提示所述优选处理方案,以便用户参照处理所述异常。
  4. 权利要求1-3任一项所述的硬件设备异常监控方法,其中,在所述通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值的步骤中:
    针对需要监控的多个所述硬件设备以及每个所述硬件设备需要监控的各项指标配置拓扑结构,并将针对各项指标所设置的阈值保存在所述拓扑结构中对应硬件设备和指标所在的节点位置。
  5. 如权利要求4所述的硬件设备异常监控方法,其中,在所述通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据的步骤中:
    遍历所述拓扑结构中的每个节点,向每个所述硬件设备发送相应的智能平台管理接口命令,以进行对应指标的数据采集。
  6. 如权利要求1-3任一项所述的硬件设备异常监控方法,其中,在所述获取各项指标所设置的阈值的步骤中:
    若所述指标为数值数据指标,则从所述SaltStack管理工具的统一设置中获取所述指标对应的报警阈值;若所述指标为非数值数据指标,则在监测到故障时,触发所述发出预警通知的步骤。
  7. 如权利要求1-3任一项所述的硬件设备异常监控方法,其中,在所述按预设方式发出预警通知的步骤中:
    采用页面的方式对所述各项指标进行报警展示,实现可视化预警。
  8. 如权利要求3所述的硬件设备异常监控方法,其中,所述优选处理方案为历史记录的与所述异常对应的处理反馈信息中,处理结果为成功且处理时间最短的处理方案。
  9. 一种硬件设备异常监控装置,其中,该装置包括:
    设置模块:用于通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
    采集模块:用于通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据;
    获取模块:用于获取各项指标所设置的阈值;
    判断模块:用于比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
    通知模块:用于当有指标出现异常时,按预设方式发出预警通知。
  10. 一种服务器,其中,所述服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的硬件设备异常监控程序,所述硬件设备异常监控程序被所述处理器执行时实现如下步骤:
    通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
    通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据;
    获取各项指标所设置的阈值;
    比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
    当有指标出现异常时,按预设方式发出预警通知。
  11. 如权利要求10所述的服务器,其中,所述硬件设备异常监控程序被所述处理器执行时还包括步骤:
    记录对所述异常的处理反馈信息;
    保存所采集的所述各项指标数据和所述处理反馈信息,以便进行回溯操作。
  12. 如权利要求11所述的服务器,其中,所述硬件设备异常监控程序被所述处理器执行时还包括步骤:
    根据历史记录的所述处理反馈信息,筛选出所述异常的优选处理方案;
    向用户提示所述优选处理方案,以便用户参照处理所述异常。
  13. 如权利要求10-12任一项所述的服务器,其中,在所述通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值的步骤中:
    针对需要监控的多个所述硬件设备以及每个所述硬件设备需要监控的各项指标配置拓扑结构,并将针对各项指标所设置的阈值保存在所述拓扑结构中对应硬件设备和指标所在的节点位置。
  14. 如权利要求13所述的服务器,其中,在所述通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据的步骤中:
    遍历所述拓扑结构中的每个节点,向每个所述硬件设备发送相应的智能平台管理接口命令,以进行对应指标的数据采集。
  15. 如权利要求10-12任一项所述的服务器,其中,在所述获取各项指标所设置的阈值的步骤中:
    若所述指标为数值数据指标,则从所述SaltStack管理工具的统一设置中获取所述指标对应的报警阈值;若所述指标为非数值数据指标,则在监测到故障时,触发所述发出预警通知的步骤。
  16. 如权利要求10-12任一项所述的服务器,其中,在所述按预设方式发出预警通知的步骤中:
    采用页面的方式对所述各项指标进行报警展示,实现可视化预警。
  17. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有硬件设备异常监控程序,所述硬件设备异常监控程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
    通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值;
    通过智能平台管理接口分别采用预设方式采集所述硬件设备的各项指标数据;
    获取各项指标所设置的阈值;
    比较所采集的各项指标数据与对应阈值,判断是否出现异常;及
    当有指标出现异常时,按预设方式发出预警通知。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述硬件设备异常监控程序被所述处理器执行时还包括步骤:
    记录对所述异常的处理反馈信息;
    保存所采集的所述各项指标数据和所述处理反馈信息,以便进行回溯操作。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述硬件设备异常监控程序被所述处理器执行时还包括步骤:
    根据历史记录的所述处理反馈信息,筛选出所述异常的优选处理方案;
    向用户提示所述优选处理方案,以便用户参照处理所述异常。
  20. 如权利要求17-19任一项所述的计算机可读存储介质,其中,在所述通过SaltStack管理工具统一设置硬件设备需要监控的各项指标和对应的阈值的步骤中:
    针对需要监控的多个所述硬件设备以及每个所述硬件设备需要监控的各项指标配置拓扑结构,并将针对各项指标所设置的阈值保存在所述拓扑结构中对应硬件设备和指标所在的节点位置。
PCT/CN2020/119081 2019-10-11 2020-09-29 硬件设备异常监控方法、装置、服务器及计算机可读存储介质 WO2021068814A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910967009.2A CN110851322A (zh) 2019-10-11 2019-10-11 硬件设备异常监控方法、服务器及计算机可读存储介质
CN201910967009.2 2019-10-11

Publications (1)

Publication Number Publication Date
WO2021068814A1 true WO2021068814A1 (zh) 2021-04-15

Family

ID=69597412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119081 WO2021068814A1 (zh) 2019-10-11 2020-09-29 硬件设备异常监控方法、装置、服务器及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110851322A (zh)
WO (1) WO2021068814A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117215498A (zh) * 2023-11-07 2023-12-12 江苏荣泽信息科技股份有限公司 基于硬件存储监管的企业数据存储智能管理系统

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851322A (zh) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 硬件设备异常监控方法、服务器及计算机可读存储介质
CN111679956A (zh) * 2020-05-07 2020-09-18 上海正网信息技术有限公司 一种带外管理系统及管理方法
CN113965447B (zh) * 2020-07-20 2023-07-21 广东芬尼克兹节能设备有限公司 一种在线云诊断方法、装置、系统、设备及存储介质
CN112416712A (zh) * 2020-11-20 2021-02-26 常州微亿智造科技有限公司 基于工业云边服务数据采集的监控方法和装置
CN112506754A (zh) * 2020-12-13 2021-03-16 国网河北省电力有限公司雄安新区供电公司 一种系统性能监测方法及平台
CN114627627A (zh) * 2020-12-14 2022-06-14 深圳Tcl新技术有限公司 设备异常处理方法、装置、终端及计算机可读存储介质
CN112561385A (zh) * 2020-12-24 2021-03-26 平安银行股份有限公司 风险监控方法及系统
CN112631887A (zh) * 2020-12-25 2021-04-09 百度在线网络技术(北京)有限公司 异常检测方法、装置、电子设备和计算机可读存储介质
CN113535407B (zh) * 2021-07-30 2024-03-19 济南浪潮数据技术有限公司 一种服务器的优化方法、系统、设备及存储介质
CN113815636B (zh) * 2021-09-28 2023-06-23 国汽(北京)智能网联汽车研究院有限公司 一种车辆安全监控方法、装置、电子设备及存储介质
CN114240155A (zh) * 2021-12-17 2022-03-25 中国工商银行股份有限公司 机房设备健康度的评估方法、装置、计算机设备
CN117992315A (zh) * 2024-04-03 2024-05-07 福建时代星云科技有限公司 一种ems平台数据可视化方法及终端

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107222346A (zh) * 2017-06-09 2017-09-29 郑州云海信息技术有限公司 一种集群节点健康状态预警方法及系统
CN107797915A (zh) * 2016-09-07 2018-03-13 北京国双科技有限公司 故障的修复方法、装置及系统
CN109165024A (zh) * 2018-07-26 2019-01-08 天讯瑞达通信技术有限公司 一种运维平台自动部署和监控服务器系统的方法
CN110851322A (zh) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 硬件设备异常监控方法、服务器及计算机可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797915A (zh) * 2016-09-07 2018-03-13 北京国双科技有限公司 故障的修复方法、装置及系统
CN107222346A (zh) * 2017-06-09 2017-09-29 郑州云海信息技术有限公司 一种集群节点健康状态预警方法及系统
CN109165024A (zh) * 2018-07-26 2019-01-08 天讯瑞达通信技术有限公司 一种运维平台自动部署和监控服务器系统的方法
CN110851322A (zh) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 硬件设备异常监控方法、服务器及计算机可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117215498A (zh) * 2023-11-07 2023-12-12 江苏荣泽信息科技股份有限公司 基于硬件存储监管的企业数据存储智能管理系统
CN117215498B (zh) * 2023-11-07 2024-01-30 江苏荣泽信息科技股份有限公司 基于硬件存储监管的企业数据存储智能管理系统

Also Published As

Publication number Publication date
CN110851322A (zh) 2020-02-28

Similar Documents

Publication Publication Date Title
WO2021068814A1 (zh) 硬件设备异常监控方法、装置、服务器及计算机可读存储介质
WO2020029407A1 (zh) 告警数据的管理方法、装置、计算机设备及存储介质
US6772099B2 (en) System and method for interpreting sensor data utilizing virtual sensors
CN109861878B (zh) kafka集群的topic数据的监控方法及相关设备
EP1998252A1 (en) Method and apparatus for generating configuration rules for computing entities within a computing environment using association rule mining
US8990372B2 (en) Operation managing device and operation management method
US8935373B2 (en) Management system and computer system management method
US9372479B1 (en) System and method for a database layer for managing a set of energy consuming devices
JP2021141582A (ja) 障害回復方法および障害回復装置、ならびに記憶媒体
US7181364B2 (en) Automated detecting and reporting on field reliability of components
US9143412B1 (en) Proxy reporting for central management systems
JP5659108B2 (ja) 運用監視装置、運用監視プログラム及び記録媒体
CN108809702B (zh) 一种设备管理方法及设备管理平台
CN117280327B (zh) 使用机器学习模型通过近实时/离线数据来检测数据中心大规模中断
WO2020024369A1 (zh) 一种基于私有云的配置运维告警模板的方法及设备
US20160274646A1 (en) System and Method for a Database Layer for Managing a Set of Energy Consuming Devices
CN111488258A (zh) 一种用于软硬件运行状态分析与预警的系统
CN112799909A (zh) 一种服务器自动化管理系统及方法
US10067549B1 (en) Computed devices
CN113708986B (zh) 服务器监控装置、方法及计算机可读存储介质
TW202006536A (zh) 設備異常告警系統、方法及可讀存儲介質
CN111858244A (zh) 一种硬盘的监控方法、系统、设备以及介质
CN111176950A (zh) 一种监控服务器集群的网卡的方法和设备
CN113010375B (zh) 设备告警方法及相关设备
WO2020000669A1 (zh) 一种数据编码分析的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874881

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20874881

Country of ref document: EP

Kind code of ref document: A1