CN115729783A - Fault risk monitoring method, apparatus, storage medium and program product - Google Patents

Fault risk monitoring method, apparatus, storage medium and program product Download PDF

Info

Publication number
CN115729783A
CN115729783A CN202211520954.6A CN202211520954A CN115729783A CN 115729783 A CN115729783 A CN 115729783A CN 202211520954 A CN202211520954 A CN 202211520954A CN 115729783 A CN115729783 A CN 115729783A
Authority
CN
China
Prior art keywords
historical
fault
current
matching
states
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211520954.6A
Other languages
Chinese (zh)
Inventor
袁野
胡大奎
张翼
高进
李晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Insurance Company of China
Original Assignee
Peoples Insurance Company of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Insurance Company of China filed Critical Peoples Insurance Company of China
Priority to CN202211520954.6A priority Critical patent/CN115729783A/en
Publication of CN115729783A publication Critical patent/CN115729783A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a fault risk monitoring method, equipment, a storage medium and a program product, wherein the method comprises the steps of obtaining the current running state of a system, the current running state comprises the current values of a plurality of running indexes, respectively matching the current running state with a plurality of historical running states, obtaining the matching degrees respectively corresponding to the plurality of historical running states, the different historical running states correspond to different sampling times, the historical running state comprises the fault events of the plurality of running indexes in the historical running state corresponding to the maximum value in the plurality of matching degrees, and generating a fault risk prompt according to the fault events. The method provided by the embodiment can improve the comprehensiveness and accuracy of monitoring.

Description

Fault risk monitoring method, apparatus, storage medium and program product
Technical Field
The embodiment of the application relates to the technical field of software testing, in particular to a fault risk monitoring method, equipment, a storage medium and a program product.
Background
In order to ensure the normal operation of software, production operation and maintenance personnel usually monitor the operation conditions of a server, a database, a network and a client deployed by an application system, and track and locate problems according to monitored index data.
In the related art, early warning is usually performed when one or more indexes exceed a threshold value.
However, in the process of implementing the present application, the inventors found that at least the following problems exist in the prior art: the existing mode can only carry out early warning on monitored indexes, fault early warning is not sent out when fault risks exist frequently, and the monitoring comprehensiveness and the monitoring accuracy are low.
Disclosure of Invention
The embodiment of the application provides a fault risk monitoring method, equipment, a storage medium and a program product, so that the comprehensiveness and the accuracy of monitoring are improved.
In a first aspect, an embodiment of the present application provides a fault risk monitoring method, including:
acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;
matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;
and acquiring a fault event in the historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event.
In one possible design, the matching the current operating state with a plurality of historical operating states respectively includes:
and if the current values do not exceed the corresponding preset threshold value ranges, respectively matching the current running state with the historical running states.
In a possible design, after obtaining the current operating state of the system, the method further includes:
if at least one current value in the current values exceeds a corresponding preset threshold range, generating a corresponding fault early warning;
screening a plurality of historical operating states to obtain a plurality of operating states to be matched; the fault event corresponding to the running state to be matched comprises a fault event corresponding to the fault early warning;
the matching the current operating state with a plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively comprises:
and respectively matching the current running state with the running states to be matched to obtain matching degrees respectively corresponding to the running states to be matched.
In a possible design, the matching the current operating state with a plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively includes:
acquiring a first vector of the current running state and a plurality of second vectors of the historical running states;
and aiming at a second vector of each historical operating state in a plurality of historical operating states, calculating the Mahalanobis distance between the first vector and the second vector, and determining the matching degree corresponding to the historical operating state according to the Mahalanobis distance.
In a possible design, before the matching the current operating state with the plurality of historical operating states, the method further includes:
dividing a preset acquisition period into a plurality of time intervals;
and aiming at each time interval, acquiring historical operating states according to the sampling frequency corresponding to the time interval, and performing associated storage on the acquired historical operating states and fault events occurring at corresponding acquisition moments.
In a possible design, before the collecting the historical operating state according to the sampling frequency corresponding to the time interval, the method further includes:
determining the demand of the fault event according to the confidence coefficient demand and the sampling error demand;
collecting a plurality of fault events according to the demand quantity;
dividing the plurality of fault events into a plurality of different time intervals;
acquiring the number of fault events occurring in each time interval;
and determining sampling frequencies corresponding to the time intervals according to the number of the fault events corresponding to the time intervals.
In one possible design, the determining, according to the number of the fault events respectively corresponding to the plurality of time intervals, the sampling frequency respectively corresponding to the plurality of time intervals includes:
determining the acquisition times corresponding to the preset acquisition period;
and calculating the ratio of the number of the fault events corresponding to each time interval to the total number of the fault events, and determining the sampling frequency corresponding to each time interval according to the ratio and the collection times.
In a possible design, before the collecting, for each time interval, the historical operating state according to the sampling frequency corresponding to the time interval, the method further includes:
determining a target duration according to the software iteration frequency;
the collecting of the historical operating state according to the sampling frequency corresponding to each time interval comprises the following steps:
aiming at each time interval in a plurality of preset acquisition periods corresponding to the target duration, acquiring historical operating states according to sampling frequency corresponding to the time interval, and adding the acquired historical operating states into a sample set;
the respectively matching the current operating state with the plurality of historical operating states includes:
and respectively matching the current running state with a plurality of historical running states in the sample set.
In a second aspect, an embodiment of the present application provides a fault risk monitoring device, including:
the acquisition module is used for acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;
the matching module is used for respectively matching the current running state with a plurality of historical running states to obtain matching degrees respectively corresponding to the plurality of historical running states; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;
and the generating module is used for acquiring a fault event in a historical operating state corresponding to the maximum value in the matching degrees and generating a fault risk prompt according to the fault event.
In a third aspect, an embodiment of the present application provides a fault risk monitoring device, including: at least one processor and memory;
the memory stores computer-executable instructions;
execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the method as set forth above in the first aspect and in various possible designs of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method according to the first aspect and various possible designs of the first aspect are implemented.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program that, when executed by a processor, implements the method as set forth in the first aspect and various possible designs of the first aspect.
The method includes obtaining a current operating state of a system, where the current operating state includes current values of a plurality of operating indexes, respectively matching the current operating state with a plurality of historical operating states to obtain matching degrees corresponding to the plurality of historical operating states, where different historical operating states correspond to different sampling times, where the historical operating state includes a fault event, where a maximum value of the plurality of matching degrees corresponds to a historical value of the plurality of operating indexes at the corresponding sampling time, and generating a fault risk prompt according to the fault event. According to the fault risk monitoring method provided by the embodiment, the current multiple operation indexes are obtained, the multiple operation indexes are matched with the historical indexes collected in advance, and the fault risk prompt is generated based on the fault event occurring in the historical operation state with the maximum matching degree, so that the monitoring comprehensiveness and accuracy can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic view of an application scenario of a fault risk monitoring method according to an embodiment of the present application;
fig. 2 is a first schematic flow chart of a fault risk monitoring method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a second method for monitoring a risk of failure according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a sample for a cluster of application servers according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a fault risk monitoring device according to an embodiment of the present application;
fig. 6 is a schematic diagram of a hardware structure of the fault risk monitoring device according to the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to ensure the normal operation of software, production operation and maintenance personnel usually monitor the operation conditions of a server, a database, a network and a client deployed by an application system, realize failure risk early warning by setting performance monitoring indexes and index thresholds, and trace and locate problems according to monitored index data.
A related performance risk early warning system usually performs early warning on a condition that one or more indexes exceed a threshold, but two disadvantages exist in an actual application scenario: firstly, the existing early warning system can only carry out early warning on monitored indexes, however, some performance indexes cannot be comprehensively monitored due to high monitoring difficulty, and potential risks caused by monitoring flaws cannot be covered by the traditional threshold early warning system, so that monitoring is not comprehensive; and secondly, when all monitored index values are lower than the threshold value, the threshold early warning system considers that the application is normally operated and cannot send out early warning, but the possibility of failure still exists actually, and the monitoring accuracy is low.
In order to solve the technical problem, the inventor of the present application finds that the matching degree between each operation index value of the application system and a sample in the past period of time and data of faults can be collected as samples, the matching degree between each current operation index value of the application system and the sample is calculated, the sample with the maximum matching degree is found, the sample can be determined to be closest to the operation state of the current application system, if the sample contains a fault event, the fault risk of the system can be judged, and meanwhile, the solution and details of the fault event of the sample can be provided for relevant personnel for reference. Based on this, the embodiment of the application provides a fault risk monitoring method, which can improve the comprehensiveness and accuracy of monitoring.
Fig. 1 is a schematic application scenario diagram of a fault risk monitoring method provided in an embodiment of the present application. As shown in fig. 1, a server 101 is communicatively connected to a monitoring device 102.
In a specific implementation process, the server 101 collects and transmits the current operating status to the monitoring device 102,. The monitoring device 102 obtains a current operating state; the current operating state comprises current values of a plurality of operating indicators; matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time; and acquiring a fault event in the historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event. According to the fault risk monitoring method provided by the embodiment of the application, the current multiple operation indexes are obtained, the multiple operation indexes are matched with the historical indexes collected in advance, and the fault risk prompt is generated based on the fault event occurring in the historical operation state with the maximum matching degree, so that the monitoring comprehensiveness and accuracy can be improved.
The monitoring device 102 may be a terminal device or a server.
The server 101 may be a single server or a plurality of servers, and when the server is a plurality of servers, whether the plurality of servers are to be monitored respectively may be determined based on the similarity and difference of the operation indexes of the servers.
For example, the server 101 may be an application server, and the current operation state of the application server may include a plurality of operation indexes, for example: service resource classes: CPU utilization rate, disk space utilization rate, disk index node inode utilization rate, memory space utilization rate and zombie process number; network class: bandwidth utilization, TCP connection state; application Programming Interface (API) class: API accuracy and API corresponding time; traffic class: the number of current online users, etc. The server 102 may also be a database server, and the current operating state of the database server may include a plurality of operating metrics, such as: service resource classes: CPU utilization rate, disk space utilization rate, disk index node inode utilization rate, memory space utilization rate and zombie process number; network class: bandwidth utilization, TCP connection status; database class: time consumed by slow SQL, table space utilization rate, proportion of active connection number to maximum connection number, number of early warning alert log deadlock errors (ORA-60), and the like.
If the server is 101 an application server and a database server, the application server and the database server can be considered separately during sampling and risk assessment because the index items required to be collected by the application server and the database server are different, and the corresponding fault event can also be divided into clear attributions. It should be noted that, in order to capture the problem, the problem may be classified according to the location of the error report of the event, i.e. the representation of the problem, rather than the generated root cause, i.e. into the problem found on the application server side and the problem found on the database server side.
In addition, the occurrence of a fault event is also divided into two cases: one is that a certain operation index of the system exceeds a set threshold value, namely the index shows abnormal performance, the early warning system automatically sends out fault early warning, all the events correspond to definite abnormal index items, and the problem of the application server side or the problem of the database server side can be directly judged according to the attribution of the index items; the second is a fault event reported by operation and maintenance personnel or users under the condition that each index is normal, and the classification of the event needs to be determined through manual judgment.
It should be noted that the scenario diagram shown in fig. 1 is only an example, and the fault risk monitoring method and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person having ordinary skill in the art knows that along with the evolution of the system and the occurrence of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
The technical solution of the present application will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a first flowchart of a fault risk monitoring method according to an embodiment of the present application. As shown in fig. 2, the method includes:
201. acquiring the current running state of the system; the current operating state includes current values of a plurality of operating metrics.
The execution subject of this embodiment may be a terminal device or a server, such as the monitoring device 102 shown in fig. 1.
The system in this embodiment may be a system installed and operated by the server 101 shown in fig. 1.
202. Matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state includes historical values of the plurality of operating indicators at corresponding sampling times.
Specifically, current operation indexes of the application system can be traversed, when a certain operation index exceeds a corresponding threshold value, early warning is automatically triggered, an early warning event is stored in an event list, and an early warning index item and an early warning index value are prompted on a front-end page. Assuming that m index items exceeding the threshold are detected, the processing can be performed based on the value of m.
In this embodiment, there are multiple matching manners between the current operating state and the historical operating state, and in some embodiments, the matching the current operating state with multiple historical operating states to obtain matching degrees corresponding to the multiple historical operating states respectively may include: acquiring a first vector of the current running state and a plurality of second vectors of the historical running states; and aiming at a second vector of each historical operating state in a plurality of historical operating states, calculating the Mahalanobis distance between the first vector and the second vector, and determining the matching degree corresponding to the historical operating state according to the Mahalanobis distance.
Specifically, the formula for calculating the mahalanobis distance may be as shown in formula (1):
Figure BDA0003973763690000071
wherein x and y represent each operation index value vector of the current system and each operation index value vector of a certain sample respectively, and DM (x, y) is the mahalanobis distance between the two. Sigma is the covariance matrix of the sample set S, and Sigma-1 is the inverse of the covariance matrix.
The covariance matrix can be calculated by referring to the following formula (2):
Figure BDA0003973763690000081
wherein m is a sample size, n is the number of the operation indexes, ci represents an ith index item, cov (ci, cj) = E ([ ci-E (ci) ] [ cj-E (cj) ]), E (ci) represents an average value of ci, and cov (ci, cj) represents a covariance between the ith index and the jth index.
In some embodiments, for the case that the threshold pre-warning is not triggered, in order to monitor comprehensiveness, the fully matching of each operation index may be performed, specifically, the respectively matching the current operation state with the multiple historical operation states may include: and if the current values do not exceed the corresponding preset threshold value range, namely m =0, respectively matching the current running state with a plurality of historical running states. For example, mahalanobis distances between the current operation index values of the application system and all samples in the sample set S may be calculated, and the sample ID with the smallest distance (the largest matching degree) to the current operation index may be recorded.
In some embodiments, for the case that the threshold pre-warning is triggered, in order to save the computing resources, the matching process may be performed only for the operation index whose current value exceeds the threshold range. Specifically, after acquiring the current operating state of the system, the method may further include: if at least one current value in the current values exceeds a corresponding preset threshold range, generating a corresponding fault early warning; screening a plurality of historical operating states to obtain a plurality of operating states to be matched; the fault event corresponding to the running state to be matched comprises a fault event corresponding to the fault early warning; the matching the current operating state with the plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively may include: and respectively matching the current running state with the running states to be matched to obtain the matching degrees respectively corresponding to the running states to be matched.
For example, for the case that a threshold warning has been triggered, that is, an event has occurred, it is only necessary to calculate mahalanobis distances between the current operation index value of the system and all samples in the sample set S whose "event marker" is "Y" and record the sample ID having the smallest distance from the current operation index.
203. And acquiring a fault event under a historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event.
Exemplarily, after a minimum distance sample ID corresponding to the maximum matching degree is obtained, judging whether the application system has a fault risk currently by combining an event flag of the sample, wherein if the event flag is Y, the risk warning is prompted by a front-end page, and the matched sample details and the corresponding event resolution analysis details are presented; an "event flag" of "N" indicates that no risk exists, and the front-end page has no risk hint.
According to the fault risk monitoring method provided by the embodiment, the current multiple operation indexes are obtained, the multiple operation indexes are matched with the historical indexes collected in advance, and the fault risk prompt is generated based on the fault event occurring in the historical operation state with the maximum matching degree, so that the monitoring comprehensiveness and accuracy can be improved.
Fig. 3 is a schematic flowchart of a second method for monitoring a risk of failure according to an embodiment of the present application. As shown in fig. 3, on the basis of the above-mentioned embodiment, for example, on the basis of the embodiment shown in fig. 2, the generation and maintenance process of the historical operating state, that is, the risk assessment sample set, is exemplarily described in this embodiment, and the method includes:
301. the preset acquisition period is divided into a plurality of time intervals.
302. And aiming at each time interval, acquiring historical operating states according to the sampling frequency corresponding to the time interval, and performing associated storage on the acquired historical operating states and fault events occurring at corresponding acquisition moments.
Specifically, each operation index value of the application system changes with time, and generally, according to experience, a service peak period of each day and a batch processing task are easier to fail than other time periods when the batch processing task is executed. Therefore, in order to make the collected samples more representative, the production events occurring in the system in the near future need to be pre-investigated, and a reasonable sampling frequency is set according to the occurrence time distribution of the production events.
In some embodiments, in order to save computing resources, different sampling frequencies are used in different time intervals, and the sampling frequency may be determined by: determining the demand of the fault event according to the confidence coefficient demand and the sampling error demand; collecting a plurality of fault events according to the demand quantity; dividing the plurality of fault events into a plurality of different time intervals; acquiring the number of fault events occurring in each time interval; and determining sampling frequencies corresponding to the time intervals according to the number of the fault events corresponding to the time intervals.
In some embodiments, the determination may be based on parameters such as the period of fluctuation of the software function for more reasonable sampling frequency settings. Specifically, the determining, according to the number of the fault events corresponding to the multiple time intervals, the sampling frequency corresponding to the multiple time intervals may include: determining the acquisition times corresponding to the preset acquisition period; and calculating the ratio of the number of the fault events corresponding to each time interval to the total number of the fault events, and determining the sampling frequency corresponding to each time interval according to the ratio and the collection times.
In some embodiments, the determination of the duration of coverage of the sample set may reference a software iteration frequency to improve rationality. Specifically, the target duration may be determined according to the software iteration frequency; the acquiring, for each time interval, a historical operating state according to a sampling frequency corresponding to the time interval may include: aiming at each time interval in a plurality of preset acquisition periods corresponding to the target duration, acquiring historical operating states according to sampling frequency corresponding to the time interval, and adding the acquired historical operating states into a sample set; the matching the current operating state with the plurality of historical operating states may include: and respectively matching the current running state with a plurality of historical running states in the sample set.
For example, first, the number of pre-survey samples may be estimated according to a classification type variable total estimation formula, which is shown in the following formula (3):
Figure BDA0003973763690000101
wherein n is sample capacity, z is obtained by looking up a z-value table according to a confidence interval, p is a proportion expected value of a target population, and delta is a sampling error range.
For example, if the confidence interval is set to be 95% (z value is 1.96), the sampling error range is 4%, p (1-p) is set to be 0.25 at the maximum, and the obtained sample capacity n is 600.25, that is, in the case of 95% confidence level and 4% sampling error range, 601 production event data which has recently occurred by the software needs to be collected for investigation and analysis.
Secondly, the homogeneous survey samples can be collected according to the estimated sample volume, and are divided into a plurality of time intervals 24 hours a day, for example, the time intervals can be divided into 24 time intervals, namely [0, 00, 1), [1, 00,2 ], [2, 00,3 ], [ 8230 ], [23: b1 B2, b3, b4, \8230;, b24.
Thirdly, the sampling frequency in different time intervals can be set according to the obtained number ratio b1, b2, b3, b4, \ 8230 \ 8230;, b24 of events in each time interval. The higher the event number ratio, the more likely the system is to malfunction during this time interval, so the higher the sampling frequency should be. Assuming reasonable assumption that the performance indicators of the software are sampled once every n minutes on average (since the performance indicators of the software generally do not fluctuate greatly in n minutes, even if the indicators fluctuate greatly, they can be collected in the next n-minute period, and index value mutation lasting for a short time can be considered as noise, for example, n can be 5), the sampling is performed 288 times in a day, and the sampling frequency of each interval is b1 × 288, b2 × 288, \8230;, b24 × 288, which is recorded as: f1 F2, f3, \ 8230 \ 8230;, f24.
And thirdly, besides the automatic early warning triggered by the super threshold, other fault events need to be manually maintained or the existing operation and maintenance management system needs to be connected to obtain event information. The risk estimation system may match the sample collected at that time based on the time of the non-automatic pre-warning event and supplement the event information to the event list associated with the sample.
Finally, the time period covered by the samples can be comprehensively considered and determined by combining the software iteration frequency and the failure occurrence frequency, for example, samples of the system running for nearly 15 days and samples of failure events occurring for nearly 180 days are acquired, the setting is such that most of the time of the system is in normal operation, a plurality of servers acquire data every 5 minutes, the amount of the samples acquired for 15 days is enough to represent the normal operation state of the system (the amount of the samples acquired for 15 days of one server is 4320), and in contrast, the failure events are only sporadic events, and the sampling period needs to be prolonged to obtain enough samples. And after each new day of sample collection, the risk assessment system automatically removes the expired sample to complete automatic updating of the sample set S.
303. Acquiring the current running state of the system; the current operating state includes current values of a plurality of operating metrics.
304. Matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state includes historical values of the plurality of operating indicators at corresponding sampling times.
305. And acquiring a fault event in the historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event.
Steps 303 to 305 in this embodiment are similar to steps 201 to 203 in the above embodiment, and are not described again here.
According to the fault risk monitoring method provided by the embodiment, the collected samples are more representative by adopting different sampling frequencies for different time partitions. Therefore, the acquisition amount can be reduced, the pertinence of sample acquisition is increased, the calculation efficiency is improved in subsequent calculation, and the calculation resources are saved.
In order to more clearly illustrate the structure of the sample set S, the following description is given by exemplifying the process of sample collection by the application server cluster.
Fig. 4 is a schematic diagram of sampling for an application server cluster according to the embodiment of the present application, as shown in fig. 4, an application may generally deploy multiple server nodes at the same time, and may sample for each server node, where sample information includes sampling time, each operation index value, and a corresponding fault event, for example, the index list and the event list in the graph are associated by a sample ID. The "event marker" in the index list may mark whether a corresponding event exists in the sample, and if the corresponding event exists, the "event marker" is "Y", and the "N" is not. The information stored in the event list can be expanded according to needs, such as manual entry or docking of an existing operation and maintenance system to obtain details of event analysis and resolution, and decision support for troubleshooting and resolution is provided for identified fault risks. The database server cluster is sampled in a similar manner.
Fig. 5 is a schematic structural diagram of a fault risk monitoring device according to an embodiment of the present application. As shown in fig. 5, the failure risk monitoring apparatus 50 includes: an acquisition module 501, a matching module 502 and a generation module 503.
An obtaining module 501, configured to obtain a current operating state of the system; the current operating state comprises current values of a plurality of operating indicators;
a matching module 502, configured to match the current operating state with multiple historical operating states, respectively, to obtain matching degrees corresponding to the multiple historical operating states, respectively; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;
the generating module 503 is configured to obtain a fault event in the historical operating state corresponding to the maximum value in the multiple matching degrees, and generate a fault risk prompt according to the fault event.
The fault risk monitoring equipment provided by the embodiment of the application acquires a plurality of current operation indexes, matches the plurality of operation indexes with the historical indexes acquired in advance, and generates a fault risk prompt based on a fault event occurring under the historical operation state with the maximum matching degree, so that the monitoring comprehensiveness and accuracy can be improved.
The fault risk monitoring device provided in the embodiment of the present application may be configured to execute the method embodiment described above, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 6 is a schematic diagram of a hardware structure of a fault risk monitoring device according to an embodiment of the present application, where the device may be a terminal device or a server.
Device 60 may include one or more of the following components: a processing component 601, a memory 602, a power component 603, a multimedia component 604, an audio component 605, an input/output (I/O) interface 606, a sensor component 607, and a communication component 608.
The processing component 601 generally controls overall operation of the device 60, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 609 to execute instructions to perform all or part of the steps of the methods described above. Further, processing component 601 may include one or more modules that facilitate interaction between processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601.
The memory 602 is configured to store various types of data to support operations at the apparatus 60. Examples of such data include instructions for any application or method operating on the device 60, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 602 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 603 provides power to the various components of the device 60. The power components 603 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 60.
The multimedia component 604 includes a screen providing an output interface between the device 60 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 604 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 60 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
Audio component 605 is configured to output and/or input audio signals. For example, audio component 605 includes a Microphone (MIC) configured to receive external audio signals when apparatus 60 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 602 or transmitted via the communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.
The I/O interface 606 provides an interface between the processing component 601 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 607 includes one or more sensors for providing various aspects of status assessment for the device 60. For example, the sensor component 607 may detect the open/closed state of the device 60, the relative positioning of components, such as a display and keypad of the device 60, the sensor component 607 may also detect a change in the position of the device 60 or a component of the device 60, the presence or absence of user contact with the device 60, the orientation or acceleration/deceleration of the device 60, and a change in the temperature of the device 60. The sensor component 607 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor component 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 60 and other devices. The device 60 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 60 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as the memory 602, that are executable by the processor 609 of the device 60 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The computer-readable storage medium may be implemented by any type of volatile or non-volatile storage device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
An embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for monitoring a risk of failure performed by the above device for monitoring a risk of failure is implemented.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims (12)

1. A fault risk monitoring method, comprising:
acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;
matching the current running state with a plurality of historical running states respectively to obtain matching degrees corresponding to the plurality of historical running states respectively; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;
and acquiring a fault event under a historical operating state corresponding to the maximum value in the matching degrees, and generating a fault risk prompt according to the fault event.
2. The method of claim 1, wherein said matching said current operating state to a plurality of historical operating states, respectively, comprises:
and if the current values do not exceed the corresponding preset threshold value ranges, respectively matching the current running state with the historical running states.
3. The method of claim 1, wherein after obtaining the current operating state of the system, further comprising:
if at least one current value in the current values exceeds a corresponding preset threshold range, generating a corresponding fault early warning;
screening a plurality of historical operating states to obtain a plurality of operating states to be matched; the fault event corresponding to the running state to be matched comprises a fault event corresponding to the fault early warning;
the matching the current operating state with the plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively comprises:
and respectively matching the current running state with the running states to be matched to obtain matching degrees respectively corresponding to the running states to be matched.
4. The method according to any one of claims 1 to 3, wherein the matching the current operating state with a plurality of historical operating states respectively to obtain matching degrees corresponding to the plurality of historical operating states respectively comprises:
acquiring a first vector of the current running state and a plurality of second vectors of the historical running states;
and aiming at a second vector of each historical operating state in a plurality of historical operating states, calculating the Mahalanobis distance between the first vector and the second vector, and determining the matching degree corresponding to the historical operating state according to the Mahalanobis distance.
5. The method according to any of claims 1-3, wherein prior to matching the current operating state with a plurality of historical operating states, respectively, further comprising:
dividing a preset acquisition period into a plurality of time intervals;
and aiming at each time interval, acquiring historical operating states according to the sampling frequency corresponding to the time interval, and performing associated storage on the acquired historical operating states and fault events occurring at corresponding acquisition moments.
6. The method according to claim 5, wherein before the collecting of the historical operating state according to the sampling frequency corresponding to the time interval, the method further comprises:
determining the demand of the fault event according to the confidence coefficient demand and the sampling error demand;
collecting a plurality of fault events according to the demand;
dividing the plurality of fault events into a plurality of different time intervals;
acquiring the number of fault events occurring in each time interval;
and determining sampling frequencies corresponding to the time intervals according to the number of the fault events corresponding to the time intervals.
7. The method according to claim 6, wherein the determining sampling frequencies corresponding to the plurality of time intervals according to the number of fault events corresponding to the plurality of time intervals comprises:
determining the acquisition times corresponding to the preset acquisition period;
and calculating the ratio of the number of the fault events corresponding to each time interval to the total number of the plurality of fault events, and determining the sampling frequency corresponding to each time interval according to the ratio and the acquisition times.
8. The method according to claim 5, wherein before the collecting of the historical operating state according to the sampling frequency corresponding to the time interval for each time interval, the method further comprises:
determining a target duration according to the software iteration frequency;
the collecting of the historical operating state according to the sampling frequency corresponding to each time interval comprises the following steps:
aiming at each time interval in a plurality of preset acquisition periods corresponding to the target duration, acquiring historical operating states according to sampling frequency corresponding to the time interval, and adding the acquired historical operating states into a sample set;
the matching the current operating state with the plurality of historical operating states respectively includes:
and respectively matching the current running state with a plurality of historical running states in the sample set.
9. A fault risk monitoring device, comprising:
the acquisition module is used for acquiring the current running state of the system; the current operating state comprises current values of a plurality of operating indicators;
the matching module is used for respectively matching the current running state with a plurality of historical running states to obtain matching degrees respectively corresponding to the plurality of historical running states; different historical operating states correspond to different sampling times; the historical operating state comprises historical values of a plurality of operating indexes at corresponding sampling time;
and the generating module is used for acquiring a fault event in a historical operating state corresponding to the maximum value in the matching degrees and generating a fault risk prompt according to the fault event.
10. A fault risk monitoring device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing computer-executable instructions stored by the memory causes the at least one processor to perform the fault risk monitoring method of any of claims 1 to 8.
11. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the fault risk monitoring method of any one of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the fault risk monitoring method according to any one of claims 1 to 8.
CN202211520954.6A 2022-11-30 2022-11-30 Fault risk monitoring method, apparatus, storage medium and program product Pending CN115729783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211520954.6A CN115729783A (en) 2022-11-30 2022-11-30 Fault risk monitoring method, apparatus, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211520954.6A CN115729783A (en) 2022-11-30 2022-11-30 Fault risk monitoring method, apparatus, storage medium and program product

Publications (1)

Publication Number Publication Date
CN115729783A true CN115729783A (en) 2023-03-03

Family

ID=85299515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211520954.6A Pending CN115729783A (en) 2022-11-30 2022-11-30 Fault risk monitoring method, apparatus, storage medium and program product

Country Status (1)

Country Link
CN (1) CN115729783A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116117827A (en) * 2023-04-13 2023-05-16 北京奔驰汽车有限公司 Industrial robot state monitoring method and device
CN116660660A (en) * 2023-06-06 2023-08-29 南京志卓电子科技有限公司 Train power supply safety monitoring system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116117827A (en) * 2023-04-13 2023-05-16 北京奔驰汽车有限公司 Industrial robot state monitoring method and device
CN116660660A (en) * 2023-06-06 2023-08-29 南京志卓电子科技有限公司 Train power supply safety monitoring system and method
CN116660660B (en) * 2023-06-06 2023-10-20 南京志卓电子科技有限公司 Train power supply safety monitoring system and method

Similar Documents

Publication Publication Date Title
CN115729783A (en) Fault risk monitoring method, apparatus, storage medium and program product
CN111092852B (en) Network security monitoring method, device, equipment and storage medium based on big data
CN111126824B (en) Multi-index correlation model training method and multi-index anomaly analysis method
CN107577522B (en) Application control method, device, storage medium and electronic equipment
CN106528389B (en) Performance evaluation method and device for system fluency and terminal
CN108696368B (en) Network element health state detection method and equipment
CN111078446A (en) Fault information acquisition method and device, electronic equipment and storage medium
CN110933115B (en) Analysis object behavior abnormity detection method and device based on dynamic session
CN112131079B (en) Data monitoring method, device, electronic equipment and storage medium
CN113347057B (en) Abnormal data detection method and device, electronic equipment and storage medium
CN107465652B (en) Operation behavior detection method, server and system
CN109739720B (en) Abnormality detection method, abnormality detection device, storage medium, and electronic apparatus
CN110543410A (en) Method for processing cluster index, method and device for inquiring cluster index
CN108900339B (en) Method and device for measuring service quality and electronic equipment
CN114491943A (en) Information processing method, temperature prediction model training method and device and electronic equipment
CN114282169A (en) Abnormal data detection method and related device
CN112182295A (en) Business processing method and device based on behavior prediction and electronic equipment
CN113901441A (en) User abnormal request detection method, device, equipment and storage medium
CN111125388B (en) Method, device and equipment for detecting multimedia resources and storage medium
CN113194474A (en) Pseudo base station positioning method and device, electronic equipment and readable storage medium
CN116541238A (en) Log file acquisition method and device, electronic equipment and readable storage medium
CN109218062B (en) Internet service alarm method and device based on confidence interval
CN108228433B (en) Electronic equipment, and method and device for counting visit time and stay time of mobile application
CN113473399B (en) Abnormal aggregation event detection method and device, computer equipment and storage medium
CN109783313B (en) System exception handling method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination