WO2019214010A1 - 一种监控设备故障的方法和装置 - Google Patents

一种监控设备故障的方法和装置 Download PDF

Info

Publication number
WO2019214010A1
WO2019214010A1 PCT/CN2018/091208 CN2018091208W WO2019214010A1 WO 2019214010 A1 WO2019214010 A1 WO 2019214010A1 CN 2018091208 W CN2018091208 W CN 2018091208W WO 2019214010 A1 WO2019214010 A1 WO 2019214010A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
data collection
fault type
script
monitoring
Prior art date
Application number
PCT/CN2018/091208
Other languages
English (en)
French (fr)
Inventor
陈涛
林烽
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to US16/463,488 priority Critical patent/US20210109800A1/en
Priority to EP18901807.0A priority patent/EP3591485B1/en
Publication of WO2019214010A1 publication Critical patent/WO2019214010A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • G05B23/0208Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterized by the configuration of the monitoring system
    • G05B23/0213Modular or universal configuration of the monitoring system, e.g. monitoring system having modules that may be combined to build monitoring program; monitoring system that can be applied to legacy systems; adaptable monitoring system; using different communication protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0775Content or structure details of the error report, e.g. specific table structure, specific error fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/20Pc systems
    • G05B2219/24Pc safety
    • G05B2219/24065Real time diagnostics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for monitoring equipment failure.
  • Most of the existing monitoring tools are system programs that come with the device, such as "mpstat" for the CPU, "iostat” for the IO, and “top” for the process.
  • the performance indicators of the device can be detected. Once the device fails, the corresponding performance indicators will be abnormal.
  • the user can view the performance indicators detected by the above monitoring tools, and then analyze based on the performance indicators and related operating parameters, so that a general understanding of the fault can be formed, and even the cause, location, time, etc. of the fault can be accurately determined. . Further, the user can also specifically provide a fault solution based on the above performance indicators.
  • the types and quantity of monitoring tools available for the target are very large, and the functional overlap between some monitoring tools is also high. Therefore, for a certain operational failure of the device, the user often detects the same or different performance through a large number of monitoring tools. Indicators, which not only waste a lot of time and effort on the user, but also consume a lot of equipment processing resources for performance monitoring.
  • embodiments of the present invention provide a method and apparatus for monitoring device failure.
  • the technical solution is as follows:
  • a method of monitoring a device failure comprising:
  • the device operation parameter is collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator;
  • the multiple preset key indicators include at least one of a CPU usage rate, a memory usage rate, a load value, an I/O waiting duration, and a CPU usage of each process.
  • the device operating parameters are collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator, including:
  • the data collection thread of the plurality of data collection tools included in the corresponding tool set script is configured for each of the target preset key indicators
  • the performing all the data collection threads to collect device operating parameters including:
  • all the data collection threads are divided into a synchronous collection thread and an asynchronous collection thread;
  • Multi-threading simultaneously executes all synchronous acquisition threads, and stores the collected device running parameters to a multi-threaded storage queue with read-write locks;
  • the asynchronous collection thread is sequentially executed.
  • the determining, according to the parameter feature corresponding to the preset fault type, determining and feeding back a fault type to which the device running parameter belongs including:
  • the device running parameter matches the state of the multiple preset key indicators, determining, according to the parameter feature corresponding to the preset fault type, the fault type to which the device running parameter belongs.
  • the determining, according to the parameter feature corresponding to the preset fault type, determining and feeding back a fault type to which the device running parameter belongs including:
  • the method further includes:
  • the script running configuration includes at least one or more of the following: a type of a monitoring tool, an operating parameter thereof, a preset key indicator, and Corresponding preset basic tools and data collection tools, parameter characteristics corresponding to fault types, and feedback methods.
  • an apparatus for monitoring equipment failure comprising:
  • a monitoring module configured to load and run a tool collection script integrated with a plurality of monitoring tools, and periodically monitor a plurality of preset key indicators by using a plurality of preset basic tools included in the tool collection script;
  • the collecting module is configured to collect, when the target preset key indicator is abnormal, the device operating parameter by using a plurality of data collecting tools included in the tool set script corresponding to the target preset key indicator;
  • the determining module is configured to determine and feed back a fault type to which the device operating parameter belongs according to the parameter feature corresponding to the preset fault type.
  • the multiple preset key indicators include at least one of a CPU usage rate, a memory usage rate, a load value, an I/O waiting duration, and a CPU usage of each process.
  • the collecting module is specifically configured to:
  • the data collection thread of the plurality of data collection tools included in the corresponding tool set script is configured for each of the target preset key indicators
  • the collecting module is specifically configured to:
  • all the data collection threads are divided into a synchronous collection thread and an asynchronous collection thread;
  • Multi-threading simultaneously executes all synchronous acquisition threads, and stores the collected device running parameters to a multi-threaded storage queue with read-write locks;
  • the asynchronous collection thread is sequentially executed.
  • the determining module is specifically configured to:
  • the device running parameter matches the state of the multiple preset key indicators, determining, according to the parameter feature corresponding to the preset fault type, the fault type to which the device running parameter belongs.
  • the determining module is specifically configured to:
  • the device further includes:
  • a receiving module configured to receive a configuration adjustment instruction input by the user for the tool set script
  • An update module configured to update a script running configuration of the tool set script according to the configuration adjustment instruction, where the script running configuration includes at least one or more of the following: a type of a monitoring tool and a running parameter thereof, and a pre- Set key indicators and their corresponding preset basic tools and data collection tools, parameter characteristics and feedback methods corresponding to the fault type.
  • an apparatus comprising a processor and a memory, the memory storing at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program
  • the code set or set of instructions is loaded and executed by the processor to implement a method of monitoring device failure as described in the first aspect.
  • a fourth aspect provides a computer readable storage medium, where the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, and the code A set or set of instructions is loaded by a processor and executed to implement a method of monitoring device failure as described in the first aspect.
  • a tool collection script integrated with multiple monitoring tools is loaded and run, and a plurality of preset basic indicators included in the tool collection script are periodically monitored; when the target preset key indicators appear
  • the device operation parameters are collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator; and the fault type to which the device operation parameter belongs is determined and fed back according to the parameter characteristics corresponding to the preset fault type.
  • the running state of the device is automatically and uniformly monitored.
  • the fault type can be fed back quickly and accurately based on the execution logic of the tool collection script, without excessive participation of the user. The consumed device processing resources are lower.
  • FIG. 1 is a flowchart of a method for monitoring a device fault according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of trigger data collection according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of performing data collection according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an apparatus for monitoring a device fault according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an apparatus for monitoring a device fault according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
  • An embodiment of the present invention provides a method for monitoring a device fault.
  • the execution entity of the method may be any device that has a program running function, and may be a server or a terminal.
  • the device may include a processor, a memory, and a transceiver, and the processor may be configured to perform processing for monitoring device failure in the following process, and the memory may be used to store data required during processing and generated data, such as a storage tool collection script, Recording device operating parameters, etc., the transceiver can be used to receive and send relevant data during processing, such as instructions for receiving user input, feedback monitoring results of device failures, and the like.
  • the device can support multiple processes running at the same time. When the process runs, it occupies different processing resources of the device CPU, uses a certain memory space, and generates disk I/O.
  • Step 101 Load and run a tool collection script integrated with multiple monitoring tools, and periodically monitor a plurality of preset key indicators through a plurality of preset basic tools included in the tool collection script.
  • a tool collection script integrated with a plurality of monitoring tools can be developed, and the tool collection script can monitor the running state of the device from different angles by using different monitoring tools, so that the generated device can be found in time. Hardware or software failure.
  • the device can load and run the tool collection script, and periodically monitor multiple preset key indicators through multiple preset basic tools included in the tool collection script.
  • a plurality of preset key indicators may be preset, and the plurality of key indicators may be used to find out whether a fault occurs on the device in a relatively simple and timely manner, and for each preset key indicator, a small amount can be reflected.
  • the preset basic tool for pre-setting key indicators has real-time monitoring of the default basic tools. In this way, a small amount of basic tools are executed to monitor key indicators, and the consumed devices have less processing resources and have less impact on device performance.
  • the foregoing plurality of preset key indicators include at least one of a CPU usage rate, a memory usage rate, a load value, an I/O waiting duration, and a CPU usage of each process. It can be understood that in other embodiments, the preset key indicators are not limited to the foregoing enumerated ones.
  • five indicators of CPU usage, memory usage, load value, I/O wait time, and CPU usage of each process can be selected as preset key indicators.
  • the detection method can be detected once per cycle, and the detection duration is 1 second.
  • the detection mode can be detected once per cycle.
  • the detection is performed once.
  • the I/O waiting time you can use the “mpstat” tool to detect it.
  • the detection method can be detected once per cycle, and the detection duration is 1 second.
  • the detection method can be performed once per cycle, and the detection duration is 1 second.
  • the detection method can be performed once per cycle, and the detection duration is 1 second.
  • step 102 when the target preset key indicator is abnormal, the device operation parameter is collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator.
  • the device when the device monitors the preset key indicators through the preset basic tool in the tool collection script, the device may perform the check according to some empirical data used in the daily analysis by using the threshold determination method, and determine the monitored preset. Whether the key indicators are abnormal, so that it can be judged whether it is necessary to trigger the subsequent data collection processing. The specific processing is shown in FIG. 2 . If an abnormality occurs in a certain preset key indicator (such as a target preset key indicator), the device may first determine a plurality of data collection tools corresponding to the target preset key indicator included in the tool collection script, and then pass the multiple The data collection tool collects equipment operating parameters.
  • a certain preset key indicator such as a target preset key indicator
  • the data collection tools in the abnormal state are preset for different preset key indicators, and the device operation parameters related to the preset key indicators abnormalities can be collected based on the data collection tools.
  • the time required to monitor the preset key indicators and judge the monitoring results is short. If the preset key indicators are abnormal, the time from the discovery of the abnormality to the collection of the operating parameters of the device is short, and the abnormal problem disappears. Or the possibility of change is low; on the other hand, if the preset key indicators are normal, no processing can be performed, and a large amount of invalid data collection processing is avoided.
  • the device running parameter may be collected for configuring the data collection thread.
  • the processing of step 102 may be as follows: when at least one target preset key indicator is abnormal, the key indicator is preset for each target, and the configuration is corresponding.
  • the tool collection script contains data collection threads of multiple data collection tools; eliminates duplicate data collection threads in all data collection threads; configures daemon threads for all data collection threads; performs all data acquisition threads to collect device operation parameters.
  • the device when detecting that at least one target preset key indicator is abnormal, may first preset a key indicator for each target, and determine a plurality of data collection tools corresponding to the target preset key indicator included in the tool set script. And then configure the data collection thread for these data collection tools. The device can then eliminate duplicate data collection threads from all configured data collection threads. Further, the device can also configure a daemon thread for all data collection threads to ensure that all the data acquisition threads are executed only after all the required device operation parameters are collected. In turn, the device can execute all data acquisition threads to collect device operating parameters.
  • the above implementation process can refer to FIG. 3.
  • the data collection thread may be divided into a synchronous execution thread and an asynchronous execution thread, and the corresponding processing may be as follows: According to the synchronization requirement of each data collection tool, all data collection threads are divided into synchronous collection thread and asynchronous collection thread; multi-thread executes all synchronous acquisition threads at the same time, and stores the collected device operation parameters to have read-write locks. Multi-threaded storage queue; after the execution of the synchronous acquisition thread ends, the asynchronous acquisition thread is sequentially executed.
  • the device may first divide all data collection threads into a synchronous collection thread and an asynchronous collection thread according to the synchronization requirement of each data collection tool, and then use multiple threads simultaneously. Execute all synchronous acquisition threads and use a multi-threaded storage queue with read-write locks to save the data collection results, thus avoiding confusion in the collected device operating parameters. After the synchronous acquisition thread ends, the device can execute the asynchronous acquisition thread sequentially.
  • Step 103 Determine and feed back a fault type to which the equipment running parameter belongs according to the parameter feature corresponding to the preset fault type.
  • the technician can predict various faults that may occur in the device, and record the parameter characteristics of the device running parameters when each fault occurs in the device, and then write the parameter characteristics and the fault type into the source code of the tool collection script.
  • the device can read the content data of the above parameter features and fault types.
  • the device can determine the fault type to which the device operating parameters belong according to the parameter characteristics corresponding to the preset fault type.
  • the device may feed back the fault type to the user of the device.
  • the feedback mode may be to display the fault type directly on the screen of the device, or write the fault type into the running log of the device, or The way the mail is sent to the user's default mailbox.
  • the device running parameter may be validated first.
  • the processing in step 103 may be as follows: if the device running parameter matches the state of the multiple preset key indicators, Then, according to the parameter characteristics corresponding to the preset fault type, the fault type to which the equipment running parameter belongs is determined and fed back.
  • the device may re-verify whether the device operating parameters are consistent with the states of the multiple preset key indicators detected in step 102, that is, whether the target preset key indicators are abnormal based on the device operating parameters. Whether the preset key indicators other than the target preset key indicators are normal. If the status does not match, the equipment running parameters of the current collection may be discarded, and the next trigger of step 102 is awaited. If the status is consistent, the fault type to which the equipment operating parameter belongs may be determined and fed back according to the parameter characteristics corresponding to the preset fault type. .
  • the device running parameter may be compared with all the fault types one by one.
  • the processing of step 103 may be as follows: determining each fault type required in the pre-stored fault type library one by one. Parameter type and corresponding parameter characteristics; sort the equipment operation parameters under the parameter type, and judge whether the finished equipment operation parameters meet the parameter characteristics; if yes, determine and feedback the current fault type, otherwise verify the next fault type.
  • the fault type library can be maintained at the device.
  • the fault type library can summarize all the possible faults of the device and the parameter characteristics of the device running parameters when the fault occurs. Furthermore, after the device operating parameters are collected, the parameter types and corresponding parameter features required for each fault type in the fault type library may be determined one by one, and then the collected device operating parameters are collated, and the corresponding parameter types are summarized.
  • Equipment operating parameters After that, it can be determined whether the sorted device running parameter meets the parameter characteristics corresponding to the current fault type. If it is met, the current fault type can be determined and fed back. Otherwise, the next fault type is verified, that is, the determined parameter type and parameter feature are re-executed. , organize the operating parameters of the equipment, and determine whether the processing of the parameter characteristics is met.
  • the user may perform any setting according to his own needs, and the corresponding processing may be as follows: receiving a configuration adjustment instruction input by the user for the tool collection script; adjusting the instruction according to the configuration Update the tool collection script's script to run the configuration.
  • the script running configuration includes at least one or more of the following items: the type of the monitoring tool and its running parameters, the preset key indicators and their corresponding preset basic tools and data collection tools, and the parameter characteristics and feedback corresponding to the fault type. the way.
  • the device when the device loads and runs the tool collection script, it will run the tool collection script by default based on the default values in the tool combination script.
  • the default value can be preset by the developer of the tool collection script. Most scenarios where monitoring equipment fails.
  • the user can adjust the configuration item to change the type of the monitoring tool in the tool collection script and its running parameters, the preset key indicators and their corresponding preset basic tools and data collection tools, the parameter characteristics and feedback methods corresponding to the fault type, and the like.
  • the script runs the configuration. Specifically, after the user performs the corresponding configuration adjustment operation, the device can receive the configuration adjustment instruction input by the user for the tool collection script, and then can update the script execution configuration of the tool collection script according to the configuration adjustment instruction.
  • a tool collection script integrated with multiple monitoring tools is loaded and run, and a plurality of preset basic indicators included in the tool collection script are periodically monitored; when the target preset key indicators appear
  • the device operation parameters are collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator; and the fault type to which the device operation parameter belongs is determined and fed back according to the parameter characteristics corresponding to the preset fault type.
  • the running state of the device is automatically and uniformly monitored.
  • the fault type can be fed back quickly and accurately based on the execution logic of the tool collection script, without excessive participation of the user. The consumed device processing resources are lower.
  • an embodiment of the present invention further provides a device for monitoring device failure, as shown in FIG. 4, where the device includes
  • the monitoring module 401 is configured to load and run a tool collection script integrated with a plurality of monitoring tools, and periodically monitor a plurality of preset key indicators by using a plurality of preset basic tools included in the tool collection script;
  • the collecting module 402 is configured to collect device operating parameters by using a plurality of data collection tools included in the tool set script corresponding to the target preset key indicator when an abnormality occurs in the target preset key indicator;
  • the determining module 403 is configured to determine and feed back a fault type to which the device operating parameter belongs according to the parameter feature corresponding to the preset fault type.
  • the multiple preset key indicators include at least one of a CPU usage rate, a memory usage rate, a load value, an I/O waiting duration, and a CPU usage of each process.
  • the collecting module 402 is specifically configured to:
  • the data collection thread of the plurality of data collection tools included in the corresponding tool set script is configured for each of the target preset key indicators
  • the collecting module 402 is specifically configured to:
  • all the data collection threads are divided into a synchronous collection thread and an asynchronous collection thread;
  • Multi-threading simultaneously executes all synchronous acquisition threads, and stores the collected device running parameters to a multi-threaded storage queue with read-write locks;
  • the asynchronous collection thread is sequentially executed.
  • the determining module 403 is specifically configured to:
  • the device running parameter matches the state of the multiple preset key indicators, determining, according to the parameter feature corresponding to the preset fault type, the fault type to which the device running parameter belongs.
  • the determining module 403 is specifically configured to:
  • the device further includes:
  • the receiving module 404 is configured to receive a configuration adjustment instruction input by the user for the tool set script
  • An update module 405 configured to update a script running configuration of the tool set script according to the configuration adjustment instruction, where the script running configuration includes at least one or more of the following: a type of a monitoring tool and an operating parameter thereof, Preset key indicators and their corresponding preset basic tools and data collection tools, parameter characteristics and feedback methods corresponding to the fault type.
  • a tool collection script integrated with multiple monitoring tools is loaded and run, and a plurality of preset basic indicators included in the tool collection script are periodically monitored; when the target preset key indicators appear
  • the device operation parameters are collected by using multiple data collection tools included in the tool set script corresponding to the target preset key indicator; and the fault type to which the device operation parameter belongs is determined and fed back according to the parameter characteristics corresponding to the preset fault type.
  • the running state of the device is automatically and uniformly monitored.
  • the fault type can be fed back quickly and accurately based on the execution logic of the tool collection script, without excessive participation of the user. The consumed device processing resources are lower.
  • the device for monitoring the device failure is only illustrated by the division of the foregoing functional modules. In actual applications, the functions may be allocated by different functional modules according to requirements. Upon completion, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above.
  • the device for monitoring the fault of the device provided by the foregoing embodiment is the same as the method for the fault of the monitoring device, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • FIG. 6 is a schematic structural diagram of a device according to an embodiment of the present invention.
  • the device 600 can vary considerably depending on configuration or performance, and can include one or more central processors 622 (eg, one or more processors) and memory 632, one or more storage applications 662 or data.
  • Storage medium 630 of 666 (for example, one or one storage device in Shanghai).
  • the memory 632 and the storage medium 630 may be short-term storage or persistent storage.
  • the program stored on storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations in the device.
  • central processor 622 can be configured to communicate with storage medium 630 on which a series of instruction operations in storage medium 630 are performed.
  • Device 600 may also include one or more power sources 626, one or more wired or wireless network interfaces 650, one or more input and output interfaces 658, one or more keyboards 656, and/or one or more operating systems 661 For example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • Apparatus 600 can include a memory, and one or more programs, wherein one or more programs are stored in the memory, and configured to be executed by one or more processors to include the one or more programs An instruction to monitor equipment failure.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Automation & Control Theory (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种监控设备故障的方法和装置,属于计算机技术领域。方法包括:加载并运行集成有多个监控类工具的工具集合脚本,周期性通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标(101);当目标预设关键指标出现异常时,通过目标预设关键指标对应的、工具集合脚本包含的多个数据采集工具采集设备运行参数(102);根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型(103)。采用本方法和装置,可以节省监控设备故障时用户花费的时间和精力,以及用于性能监控的设备处理资源。

Description

一种监控设备故障的方法和装置 技术领域
本发明涉及计算机技术领域,特别涉及一种监控设备故障的方法和装置。
背景技术
设备在运行的过程中,经常会因为硬件或软件上的问题出现运行故障,从而可能导致设备处理能力下降、执行逻辑错误,甚至会出现设备宕机、组件损坏等现象。为了能尽早发现并及时解决设备的运行故障,用户往往可以通过性能监控程序(可称为监控工具)查看设备的性能指标,了解设备的运行状态。
现有的监控工具大多为设备自带的系统程序,如针对CPU的“mpstat”,针对IO的“iostat”,针对进程的“top”等。通过这些监控工具可以检测到设备的性能指标,一旦设备发生运行故障,相应的性能指标就会出现异常。这样,用户可以查看上述监控工具检测出的性能指标,再基于性能指标和相关运行参数进行分析,从而可以对故障形成一个大致的认知,甚至可以精准地判断出故障的原因、位置、时间等。进一步的,用户还可以基于上述性能指标针对性地给出故障解决方案。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
目标可用的监控工具的种类和数量非常多,且部分监控工具之间的功能重叠度也较高,故而针对设备的某一运行故障,用户往往会通过大量的监控工具检测到相同或不同的性能指标,这样不仅浪费了用户的大量时间和精力,还消耗了大量用于性能监控的设备处理资源。
发明内容
为了解决现有技术的问题,本发明实施例提供了一种监控设备故障的方法和装置。所述技术方案如下:
第一方面,提供了一种监控设备故障的方法,所述方法包括:
加载并运行集成有多个监控类工具的工具集合脚本,周期性通过所述工具 集合脚本包含的多个预设基础工具监控多个预设关键指标;
当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数;
根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述多个预设关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。
可选的,所述当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数,包括:
当至少一个目标预设关键指标出现异常时,针对每个所述目标预设关键指标,配置对应的所述工具集合脚本包含的多个数据采集工具的数据采集线程;
剔除所有数据采集线程中重复的数据采集线程;
配置所述所有数据采集线程的守护线程;
执行所述所有数据采集线程采集设备运行参数。
可选的,所述执行所述所有数据采集线程采集设备运行参数,包括:
根据每个所述数据采集工具的同步性需求,将所述所有数据采集线程分为同步采集线程和异步采集线程;
多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;
在所述同步采集线程执行结束后,顺序执行所述异步采集线程。
可选的,所述根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型,包括:
如果所述设备运行参数与所述多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型,包括:
逐一确定所述预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;
整理所述参数类型下的设备运行参数,并判断整理后的设备运行参数是否 符合所述参数特征;
如果符合,则确定并反馈当前的故障类型,否则验证下一故障类型。
可选的,其特征在于,所述方法还包括:
接收用户输入的对于所述工具集合脚本的配置调整指令;
根据所述配置调整指令更新所述工具集合脚本的脚本运行配置,其中,所述脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
第二方面,提供了一种监控设备故障的装置,所述装置包括:
监控模块,用于加载并运行集成有多个监控类工具的工具集合脚本,周期性通过所述工具集合脚本包含的多个预设基础工具监控多个预设关键指标;
采集模块,用于当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数;
确定模块,用于根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述多个预设关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。
可选的,所述采集模块,具体用于:
当至少一个目标预设关键指标出现异常时,针对每个所述目标预设关键指标,配置对应的所述工具集合脚本包含的多个数据采集工具的数据采集线程;
剔除所有数据采集线程中重复的数据采集线程;
配置所述所有数据采集线程的守护线程;
执行所述所有数据采集线程采集设备运行参数。
可选的,所述采集模块,具体用于:
根据每个所述数据采集工具的同步性需求,将所述所有数据采集线程分为同步采集线程和异步采集线程;
多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;
在所述同步采集线程执行结束后,顺序执行所述异步采集线程。
可选的,所述确定模块,具体用于:
如果所述设备运行参数与所述多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述确定模块,具体用于:
逐一确定所述预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;
整理所述参数类型下的设备运行参数,并判断整理后的设备运行参数是否符合所述参数特征;
如果符合,则确定并反馈当前的故障类型,否则验证下一故障类型。
可选的,所述装置还包括:
接收模块,用于接收用户输入的对于所述工具集合脚本的配置调整指令;
更新模块,用于根据所述配置调整指令更新所述工具集合脚本的脚本运行配置,其中,所述脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
第三方面,提供了一种设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面所述的监控设备故障的方法。
第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面所述的监控设备故障的方法。
本发明实施例提供的技术方案带来的有益效果是:
本发明实施例中,加载并运行集成有多个监控类工具的工具集合脚本,周期性通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标;当目标预设关键指标出现异常时,通过目标预设关键指标对应的、工具集合脚本包 含的多个数据采集工具采集设备运行参数;根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。这样,通过工具集合脚本中的监控类工具,统一自动地对设备的运行状态进行监控,当设备故障时,可以基于工具集合脚本的执行逻辑,较为快速准确地反馈故障类型,无需用户过多参与,消耗的设备处理资源较低。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种监控设备故障的方法流程图;
图2是本发明实施例提供的一种触发数据采集的流程示意图;
图3是本发明实施例提供的一种执行数据采集的流程示意图;
图4是本发明实施例提供的一种监控设备故障的装置结构示意图;
图5是本发明实施例提供的一种监控设备故障的装置结构示意图;
图6是本发明实施例提供的一种设备的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
本发明实施例提供了一种监控设备故障的方法,该方法的执行主体可以是具备程序运行功能的任意设备,可以是服务器或者是终端。设备可以包括处理器、存储器、收发器,处理器可以用于进行下述流程中对于监控设备故障的处理,存储器可以用于存储处理过程中需要的数据以及产生的数据,如存储工具集合脚本、记录设备运行参数等,收发器可以用于接收和发送处理过程中的相关数据,如用于接收用户输入的指令,反馈设备故障的监控结果等。设备可以支持多个进程同时运行,进程运行时不同程度的占用设备CPU的处理资源、使用一定的内存空间,并产生磁盘I/O。
下面将结合具体实施方式,对图1所示的处理流程进行详细的说明,内容 可以如下:
步骤101,加载并运行集成有多个监控类工具的工具集合脚本,周期性通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标。
在实施中,可以开发集成了多个监控类工具的工具集合脚本,通过该工具集合脚本可以利用不同的监控类工具从不同角度监控设备的运行状态,从而可以及时发现设备运行过程中的产生的硬件或软件故障。具体的,设备上安装了上述工具集合脚本之后,设备可以加载并运行该工具集合脚本,并周期性地通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标。此处,多个预设关键指标可以是预先设定的,通过该多个关键指标可以较为简单、及时地发现设备上是否出现故障,而针对每个预设关键指标,可以通过少量的能够反映该预设关键指标是否存在异常信息的预设基础工具进行实时监控,这样,执行少量的基础工具监控关键指标,消耗的设备处理资源较少,对设备性能产生影响较小。
可选的,上述多个预设关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。可以理解的是,在其他实施例中,预设关键指标不限于前述列举的这几种。
在实施中,可以选取CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率这五项指标作为预设关键指标。针对性地,对于CPU使用率,可以使用“mpstat”工具进行检测,检测方式可以为每周期检测1次,检测时长为1秒;对于内存使用率,可以通过查看“free–m”的“used”和“free”字段进行检测,检测方式可以为每周期检测1次;对于负载值,可以通过查看“/proc/load avg”文件的1分钟内的负载字段进行检测,检测方式可以为每周期检测1次;对于I/O等待时长,可以使用“mpstat”工具进行检测,检测方式可以为每周期检测1次,检测时长为1秒;对于各进程的CPU使用率,可以使用“top”工具进行检测,检测方式可以为每周期检测1次,检测时长为1秒。
步骤102,当目标预设关键指标出现异常时,通过目标预设关键指标对应的、工具集合脚本包含的多个数据采集工具采集设备运行参数。
在实施中,设备在通过工具集合脚本中的预设基础工具监控预设关键指标时,可以通过阈值判定的方式,根据日常分析中用到的一些经验数据来进行检验,判断监控到的预设关键指标是否异常,从而可以判断是否有必要触发后续 数据采集的处理,具体的处理参考图2所示。而如果发现某个预设关键指标(如目标预设关键指标)出现异常时,设备则可以先确定工具集合脚本包含的、目标预设关键指标对应的多个数据采集工具,然后通过该多个数据采集工具采集设备运行参数。可以理解,针对不同的预设关键指标分别预先设定其处于异常状态下的数据采集工具,基于这些数据采集工具可以采集到与上述预设关键指标异常相关的设备运行参数。此处,一方面,对预设关键指标进行监控以及对监控结果进行判断,所需时间都较短,若是预设关键指标异常,从发现异常到采集设备运行参数的时间较短,异常问题消失或改变的可能性较低;另一方面,若是预设关键指标正常,则可以不进行任何处理,避免了大量无效数据的采集处理。
可选的,可以为配置数据采集线程的方式采集设备运行参数,相应的,步骤102的处理可以如下:当至少一个目标预设关键指标出现异常时,针对每个目标预设关键指标,配置对应的工具集合脚本包含的多个数据采集工具的数据采集线程;剔除所有数据采集线程中重复的数据采集线程;配置所有数据采集线程的守护线程;执行所有数据采集线程采集设备运行参数。
在实施中,当检测到至少一个目标预设关键指标出现异常时,设备可以先分别针对每个目标预设关键指标,确定工具集合脚本包含的、目标预设关键指标对应的多个数据采集工具,然后配置对应这些数据采集工具的数据采集线程。之后,设备可以从已配置的所有数据采集线程中剔除重复的数据采集线程。进一步的,设备还可以为所有数据采集线程配置守护线程,以保证仅当所有数据采集线程都执行完毕,所有需要的设备运行参数均采集完成后,才进行后续处理。进而,设备可以执行所有数据采集线程采集设备运行参数。上述执行过程可以参考图3。
可选的,为了减轻数据采集过程对设备CPU和内存的压力的同时,保证采集到的设备运行参数的一致性,可以将数据采集线程分为同步执行线程和异步执行线程,相应的处理可以如下:根据每个数据采集工具的同步性需求,将所有数据采集线程分为同步采集线程和异步采集线程;多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;在同步采集线程执行结束后,顺序执行异步采集线程。
在实施中,不同的数据采集工具对于启动时间的同步性需求不同,例如 “mpstat”、“top”等工具存在较高的同步性需求,而“load”等工具的同步性需求相对较低。故而,设备在执行所有数据采集线程采集设备运行参数的过程中,可以先根据每个数据采集工具的同步性需求,将所有数据采集线程分为同步采集线程和异步采集线程,再利用多线程同时执行所有同步采集线程,并使用带有读写锁的多线程存储队列来保存数据采集结果,从而可以避免采集的设备运行参数出现混乱。而在同步采集线程结束之后,设备可以顺序执行异步采集线程。
步骤103,根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。
在实施中,技术人员可以预测设备可能出现的各种故障,并记录设备发生每个故障时设备运行参数的参数特征,之后可以将参数特征和故障类型对应写入工具集合脚本的源代码中,设备在加载并运行工具集合脚本后,可以读取上述参数特征和故障类型的内容数据。这样,在采集到设备运行参数之后,设备可以根据预设的故障类型对应的参数特征,确定上述设备运行参数所属的故障类型。进而,设备可以向设备的用户反馈上述故障类型,具体的,反馈方式可以为直接在设备的屏幕上显示故障类型,或者是将故障类型写入设备的运行日志中,还可以是将故障类型以邮件的方式发送至用户预设的邮箱中。
可选的,确定设备运行参数所属的故障类型之前,可以先对设备运行参数进行有效性验证,相应的,步骤103的处理可以如下:如果设备运行参数与多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。
在实施中,在采集到设备运行参数之后,设备可以重新验证设备运行参数是否与步骤102中检测到的多个预设关键指标的状态相符,即基于设备运行参数判断目标预设关键指标是否异常,除目标预设关键指标外的其它预设关键指标是否正常。若状态不符,则可以放弃本次采集的设备运行参数,等待步骤102的下一次触发,如果状态相符,则可以根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。
可选的,确定设备的故障类型时,可以将设备运行参数与所有故障类型逐一进行对比,相应的,步骤103的处理可以如下:逐一确定预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;整理参数类型下的设备运行参数,并判断整理后的设备运行参数是否符合参数特征;如果符合,则确 定并反馈当前的故障类型,否则验证下一故障类型。
在实施中,加载工具集合脚本之后,设备处可以维护有故障类型库,故障类型库中可以汇总有设备所有可能出现的故障,以及出现该故障时设备运行参数的参数特征。进而,在采集完设备运行参数之后,可以逐一确定上述故障类型库中每个故障类型所需的参数类型和对应的参数特征,然后对采集到的设备运行参数进行整理,汇总出相应参数类型下的设备运行参数。之后,可以判断整理后的设备运行参数是否符合当前故障类型对应的参数特征,如果符合,则可以确定并反馈当前的故障类型,否则验证下一故障类型,即重新进行上述确定参数类型、参数特征,整理设备运行参数,判断是否符合参数特征的处理。
可选的,对于上述过程中涉及的工具集合脚本的具体配置,用户均可以根据自身需要进行任意设置,相应的处理可以如下:接收用户输入的对于工具集合脚本的配置调整指令;根据配置调整指令更新工具集合脚本的脚本运行配置。
其中,脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
在实施中,设备在加载并运行工具集合脚本时,将会默认基于工具结合脚本中的缺省值运行工具集合脚本,该缺省值可以是由工具集合脚本的开发人员预先设置的,适用于大部分监控设备故障的场景。而用户可以通过调整配置项来改变工具集合脚本中监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式等脚本运行配置,具体的,在用户进行相应配置调整操作后,设备可以接收到用户输入的对于工具集合脚本的配置调整指令,然后可以根据配置调整指令更新工具集合脚本的脚本运行配置。
本发明实施例中,加载并运行集成有多个监控类工具的工具集合脚本,周期性通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标;当目标预设关键指标出现异常时,通过目标预设关键指标对应的、工具集合脚本包含的多个数据采集工具采集设备运行参数;根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。这样,通过工具集合脚本中的监控类工具,统一自动地对设备的运行状态进行监控,当设备故障时,可以基于工具集合脚本的执行逻辑,较为快速准确地反馈故障类型,无需用户过多参 与,消耗的设备处理资源较低。
基于相同的技术构思,本发明实施例还提供了一种监控设备故障的装置,如图4所示,所述装置包括
监控模块401,用于加载并运行集成有多个监控类工具的工具集合脚本,周期性通过所述工具集合脚本包含的多个预设基础工具监控多个预设关键指标;
采集模块402,用于当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数;
确定模块403,用于根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述多个预设关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。
可选的,所述采集模块402,具体用于:
当至少一个目标预设关键指标出现异常时,针对每个所述目标预设关键指标,配置对应的所述工具集合脚本包含的多个数据采集工具的数据采集线程;
剔除所有数据采集线程中重复的数据采集线程;
配置所述所有数据采集线程的守护线程;
执行所述所有数据采集线程采集设备运行参数。
可选的,所述采集模块402,具体用于:
根据每个所述数据采集工具的同步性需求,将所述所有数据采集线程分为同步采集线程和异步采集线程;
多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;
在所述同步采集线程执行结束后,顺序执行所述异步采集线程。
可选的,所述确定模块403,具体用于:
如果所述设备运行参数与所述多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
可选的,所述确定模块403,具体用于:
逐一确定所述预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;
整理所述参数类型下的设备运行参数,并判断整理后的设备运行参数是否符合所述参数特征;
如果符合,则确定并反馈当前的故障类型,否则验证下一故障类型。
可选的,如图5所示,所述装置还包括:
接收模块404,用于接收用户输入的对于所述工具集合脚本的配置调整指令;
更新模块405,用于根据所述配置调整指令更新所述工具集合脚本的脚本运行配置,其中,所述脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
本发明实施例中,加载并运行集成有多个监控类工具的工具集合脚本,周期性通过工具集合脚本包含的多个预设基础工具监控多个预设关键指标;当目标预设关键指标出现异常时,通过目标预设关键指标对应的、工具集合脚本包含的多个数据采集工具采集设备运行参数;根据预设的故障类型对应的参数特征,确定并反馈设备运行参数所属的故障类型。这样,通过工具集合脚本中的监控类工具,统一自动地对设备的运行状态进行监控,当设备故障时,可以基于工具集合脚本的执行逻辑,较为快速准确地反馈故障类型,无需用户过多参与,消耗的设备处理资源较低。
需要说明的是:上述实施例提供的监控设备故障的装置在监控设备故障时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的监控设备故障的装置与监控设备故障的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图6是本发明实施例提供的设备的结构示意图。该设备600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器622(例如,一个或一个以上处理器)和存储器632,一个或一个以上存储应用程序662或数据666的存储介质630(例如一个或一个以上海量存储设备)。其中,存储器632和存储介质630可以是短暂存储或持久存储。存储在存储介质630的程序可以 包括一个或一个以上模块(图示没标出),每个模块可以包括对设备中的一系列指令操作。更进一步地,中央处理器622可以设置为与存储介质630通信,在设备600上执行存储介质630中的一系列指令操作。
设备600还可以包括一个或一个以上电源626,一个或一个以上有线或无线网络接口650,一个或一个以上输入输出接口658,一个或一个以上键盘656,和/或,一个或一个以上操作系统661,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
设备600可以包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行上述监控设备故障的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种监控设备故障的方法,其特征在于,所述方法包括:
    加载并运行集成有多个监控类工具的工具集合脚本,周期性通过所述工具集合脚本包含的多个预设基础工具监控多个预设关键指标;
    当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数;
    根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
  2. 根据权利要求1所述的方法,其特征在于,所述多个预设关键指标至少包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。
  3. 根据权利要求1所述的方法,其特征在于,所述当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数,包括:
    当至少一个目标预设关键指标出现异常时,针对每个所述目标预设关键指标,配置对应的所述工具集合脚本包含的多个数据采集工具的数据采集线程;
    剔除所有数据采集线程中重复的数据采集线程;
    配置所述所有数据采集线程的守护线程;
    执行所述所有数据采集线程采集设备运行参数。
  4. 根据权利要求3所述的方法,其特征在于,所述执行所述所有数据采集线程采集设备运行参数,包括:
    根据每个所述数据采集工具的同步性需求,将所述所有数据采集线程分为同步采集线程和异步采集线程;
    多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;
    在所述同步采集线程执行结束后,顺序执行所述异步采集线程。
  5. 根据权利要求1所述的方法,其特征在于,所述根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型,包括:
    如果所述设备运行参数与所述多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
  6. 根据权利要求1所述的方法,其特征在于,所述根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型,包括:
    逐一确定所述预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;
    整理所述参数类型下的设备运行参数,并判断整理后的设备运行参数是否符合所述参数特征;
    如果符合,则确定并反馈当前的故障类型,否则验证下一故障类型。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述方法还包括:
    接收用户输入的对于所述工具集合脚本的配置调整指令;
    根据所述配置调整指令更新所述工具集合脚本的脚本运行配置,其中,所述脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
  8. 一种监控设备故障的装置,其特征在于,所述装置包括:
    监控模块,用于加载并运行集成有多个监控类工具的工具集合脚本,周期性通过所述工具集合脚本包含的多个预设基础工具监控多个预设关键指标;
    采集模块,用于当目标预设关键指标出现异常时,通过所述目标预设关键指标对应的、所述工具集合脚本包含的多个数据采集工具采集设备运行参数;
    确定模块,用于根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
  9. 根据权利要求8所述的装置,其特征在于,所述多个预设关键指标至少 包括CPU使用率、内存使用率、负载值、I/O等待时长和各进程的CPU使用率中的一项或多项。
  10. 根据权利要求8所述的装置,其特征在于,所述采集模块,具体用于:
    当至少一个目标预设关键指标出现异常时,针对每个所述目标预设关键指标,配置对应的所述工具集合脚本包含的多个数据采集工具的数据采集线程;
    剔除所有数据采集线程中重复的数据采集线程;
    配置所述所有数据采集线程的守护线程;
    执行所述所有数据采集线程采集设备运行参数。
  11. 根据权利要求10所述的装置,其特征在于,所述采集模块,具体用于:
    根据每个所述数据采集工具的同步性需求,将所述所有数据采集线程分为同步采集线程和异步采集线程;
    多线程同时执行所有同步采集线程,并将采集的设备运行参数存储至具备读写锁的多线程存储队列;
    在所述同步采集线程执行结束后,顺序执行所述异步采集线程。
  12. 根据权利要求8所述的装置,其特征在于,所述确定模块,具体用于:
    如果所述设备运行参数与所述多个预设关键指标的状态相符,则根据预设的故障类型对应的参数特征,确定并反馈所述设备运行参数所属的故障类型。
  13. 根据权利要求8所述的装置,其特征在于,所述确定模块,具体用于:
    逐一确定所述预存的故障类型库中的每个故障类型所需的参数类型和对应的参数特征;
    整理所述参数类型下的设备运行参数,并判断整理后的设备运行参数是否符合所述参数特征;
    如果符合,则确定并反馈当前的故障类型,否则验证下一故障类型。
  14. 根据权利要求8-13所述的装置,其特征在于,所述装置还包括:
    接收模块,用于接收用户输入的对于所述工具集合脚本的配置调整指令;
    更新模块,用于根据所述配置调整指令更新所述工具集合脚本的脚本运行配置,其中,所述脚本运行配置至少包括以下一项或多项内容:监控类工具的种类及其运行参数、预设关键指标及其对应的预设基础工具和数据采集工具、故障类型对应的参数特征和反馈方式。
  15. 一种设备,其特征在于,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的监控设备故障的方法。
  16. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任一所述的监控设备故障的方法。
PCT/CN2018/091208 2018-05-08 2018-06-14 一种监控设备故障的方法和装置 WO2019214010A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/463,488 US20210109800A1 (en) 2018-05-08 2018-06-14 Method and apparatus for monitoring device failure
EP18901807.0A EP3591485B1 (en) 2018-05-08 2018-06-14 Method and device for monitoring for equipment failure

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810433735.1A CN108508874B (zh) 2018-05-08 2018-05-08 一种监控设备故障的方法和装置
CN201810433735.1 2018-05-08

Publications (1)

Publication Number Publication Date
WO2019214010A1 true WO2019214010A1 (zh) 2019-11-14

Family

ID=63399871

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/091208 WO2019214010A1 (zh) 2018-05-08 2018-06-14 一种监控设备故障的方法和装置

Country Status (4)

Country Link
US (1) US20210109800A1 (zh)
EP (1) EP3591485B1 (zh)
CN (1) CN108508874B (zh)
WO (1) WO2019214010A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035466A (zh) * 2021-11-05 2022-02-11 肇庆高峰机械科技有限公司 一种双工位磁片排列机的控制系统

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634803A (zh) * 2018-11-16 2019-04-16 网宿科技股份有限公司 一种上报设备异常的方法和装置
CN110620806B (zh) * 2019-03-18 2022-07-22 北京无限光场科技有限公司 信息生成方法和装置
CN109992600B (zh) * 2019-03-28 2021-09-07 佛山市百斯特电器科技有限公司 一种设备故障的响应方法及设备
CN110740061B (zh) * 2019-10-18 2020-09-29 北京三快在线科技有限公司 故障预警方法、装置及计算机存储介质
EP3817004A1 (en) * 2019-11-01 2021-05-05 Koninklijke Philips N.V. System and method for classifying and managing medical application disconnects
CN113645624A (zh) * 2021-08-25 2021-11-12 广东省高峰科技有限公司 一种异常网络数据的排查方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1733506B1 (en) * 2004-03-18 2012-08-15 ADVA AG Optical Networking Fault management in an ethernet based communication system
CN104378246A (zh) * 2014-12-09 2015-02-25 福建星网锐捷网络有限公司 一种网络设备故障定位系统、方法及装置
US9026646B2 (en) * 2011-09-16 2015-05-05 Tripwire, Inc. Methods and apparatus for remediating policy test failures, including correlating changes to remediation processes
CN105306272A (zh) * 2015-11-10 2016-02-03 中国建设银行股份有限公司 信息系统故障场景信息收集方法及系统
CN105323111A (zh) * 2015-11-17 2016-02-10 南京南瑞集团公司 一种运维自动化系统及方法
CN107104838A (zh) * 2017-05-15 2017-08-29 北京奇艺世纪科技有限公司 一种信息处理方法、服务器及终端
CN107846314A (zh) * 2017-10-31 2018-03-27 广西宜州市联森网络科技有限公司 一种智能运维管理系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9684554B2 (en) * 2007-03-27 2017-06-20 Teradata Us, Inc. System and method for using failure casting to manage failures in a computed system
KR101405917B1 (ko) * 2008-10-06 2014-06-12 삼성전자주식회사 화상형성장치에서 웹 메일에 파일을 첨부하여 전송하는 방법 및 이를 수행하는 화상형성장치
CN102546266B (zh) * 2012-03-09 2015-01-28 中兴通讯股份有限公司 一种网络故障的诊断方法及平台
CN103178615B (zh) * 2013-02-05 2016-09-14 广东电网公司 电力设备故障监控方法及其系统
CN104852810B (zh) * 2014-02-18 2018-11-30 中国移动通信集团公司 一种业务平台异常的确定方法和设备
CN104038392A (zh) * 2014-07-04 2014-09-10 云南电网公司 一种云计算资源服务质量评估方法
CN105320585B (zh) * 2014-07-08 2019-04-02 北京启明星辰信息安全技术有限公司 一种实现应用故障诊断的方法及装置
CN104410535B (zh) * 2014-12-23 2018-03-30 浪潮电子信息产业股份有限公司 一种云资源智能监控告警方法
CN106776248A (zh) * 2016-11-11 2017-05-31 乐视控股(北京)有限公司 一种数据处理的方法和装置
CN107168846A (zh) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 电子设备的监控方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1733506B1 (en) * 2004-03-18 2012-08-15 ADVA AG Optical Networking Fault management in an ethernet based communication system
US9026646B2 (en) * 2011-09-16 2015-05-05 Tripwire, Inc. Methods and apparatus for remediating policy test failures, including correlating changes to remediation processes
CN104378246A (zh) * 2014-12-09 2015-02-25 福建星网锐捷网络有限公司 一种网络设备故障定位系统、方法及装置
CN105306272A (zh) * 2015-11-10 2016-02-03 中国建设银行股份有限公司 信息系统故障场景信息收集方法及系统
CN105323111A (zh) * 2015-11-17 2016-02-10 南京南瑞集团公司 一种运维自动化系统及方法
CN107104838A (zh) * 2017-05-15 2017-08-29 北京奇艺世纪科技有限公司 一种信息处理方法、服务器及终端
CN107846314A (zh) * 2017-10-31 2018-03-27 广西宜州市联森网络科技有限公司 一种智能运维管理系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3591485A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035466A (zh) * 2021-11-05 2022-02-11 肇庆高峰机械科技有限公司 一种双工位磁片排列机的控制系统
CN114035466B (zh) * 2021-11-05 2022-05-31 肇庆高峰机械科技有限公司 一种双工位磁片排列机的控制系统

Also Published As

Publication number Publication date
CN108508874B (zh) 2019-12-31
EP3591485A4 (en) 2020-04-29
US20210109800A1 (en) 2021-04-15
EP3591485B1 (en) 2021-08-25
EP3591485A1 (en) 2020-01-08
CN108508874A (zh) 2018-09-07

Similar Documents

Publication Publication Date Title
WO2019214010A1 (zh) 一种监控设备故障的方法和装置
US8286139B2 (en) Call stack sampling for threads having latencies exceeding a threshold
US8141053B2 (en) Call stack sampling using a virtual machine
US10095598B2 (en) Transaction server performance monitoring using component performance data
US10545807B2 (en) Method and system for acquiring parameter sets at a preset time interval and matching parameters to obtain a fault scenario type
US7958402B2 (en) Generate diagnostic data for overdue thread in a data processing system
US20100017583A1 (en) Call Stack Sampling for a Multi-Processor System
CN109165138B (zh) 一种监控设备故障的方法和装置
US20170147422A1 (en) External software fault detection system for distributed multi-cpu architecture
CN106919462B (zh) 一种生成处理器故障记录的方法及装置
CN110502366B (zh) 案例执行方法、装置、设备及计算机可读存储介质
US11422920B2 (en) Debugging multiple instances of code using thread patterns
WO2018233170A1 (zh) 日志记录方法、装置、计算机设备及存储介质
CN112202628B (zh) 一种WiFi模块串口协议自动化测试系统及方法
CN109634803A (zh) 一种上报设备异常的方法和装置
Brocanelli et al. Hang doctor: runtime detection and diagnosis of soft hangs for smartphone apps
Sidorov Methods and tools to increase fault tolerance of high-performance computing systems
CN110427329B (zh) 一种数据库sql性能数据的采集方法及系统
JPH09179754A (ja) タスク監視装置及びタスク監視方法
CN112988503A (zh) 分析方法、分析装置、电子装置和存储介质
CN112631872A (zh) 一种多核系统的异常处理方法及装置
Drangmeister et al. Greening software with continuous energy efficiency measurement
JP6751231B2 (ja) ジョブスケジューラ試験プログラム、ジョブスケジューラ試験方法及び並列処理装置
CN113836035B (zh) 电池管理系统测试方法、装置及电子设备
Fu et al. Enhancing MapReduce Fault Recovery Through Binocular Speculation

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018901807

Country of ref document: EP

Effective date: 20190730

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18901807

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE