CN115514621B

CN115514621B - Fault monitoring method, electronic device and storage medium

Info

Publication number: CN115514621B
Application number: CN202211431652.1A
Authority: CN
Inventors: 史洋洋; 潘涌; 吕彪; 钮骏凯; 韩泽鋆; 杨帅; 芮藤长; 肖雄; 祝顺民
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-21
Anticipated expiration: 2042-11-15
Also published as: CN115514621A

Abstract

According to the embodiment of the application, a monitoring threshold value matched with stability characteristics of a certain preset parameter can be determined, the monitoring threshold values matched with the preset parameters of different stability characteristics are different, the monitoring threshold value matched with the stability characteristics is utilized to carry out fault monitoring on cloud network equipment and the preset parameters, the accuracy of fault monitoring can be effectively improved, and fault misinformation is reduced.

Description

Fault monitoring method, electronic device and storage medium

Technical Field

The present application relates to the field of cloud computing technologies, and in particular, to a fault monitoring method, an electronic device, and a storage medium.

Background

For a large-scale and complex cloud network operation and maintenance scene, the quantity of time sequence data generated per minute is in the billion level, and monitoring and analyzing the data can determine whether the cloud network equipment fails or not.

At present, a great amount of false reports are generated in the process of monitoring the cloud network equipment or each preset parameter generated by the cloud network equipment, so that real fault equipment and fault data are submerged.

Disclosure of Invention

The embodiment of the application provides a fault monitoring method, electronic equipment and a storage medium, so as to improve the accuracy of fault monitoring on cloud network equipment and each preset parameter generated by the cloud network equipment.

In a first aspect, an embodiment of the present application provides a fault monitoring method, including:

acquiring target time sequence data which is currently generated by cloud network equipment and corresponds to preset parameters, and a monitoring threshold value matched with the preset parameters; the monitoring threshold is determined according to the stationarity characteristics of time sequence data corresponding to preset parameters generated when the cloud network equipment operates normally;

determining the current stationarity characteristic of the preset parameter according to the target time sequence data;

and carrying out fault monitoring on the preset parameters of the cloud network equipment according to the current stationarity characteristic and the monitoring threshold.

In a possible implementation manner, determining the current stationarity characteristic of the preset parameter according to the target time series data includes:

determining an initial stationary characteristic of the target time series data;

passing the initial stationary features and the target timing data into a classification model;

and processing the initial stationarity feature and the target time sequence data by the classification model, and outputting the stationarity feature.

In a possible embodiment, the classification model is obtained by training, the training comprising:

respectively acquiring a plurality of sample time sequence data corresponding to each preset parameter in at least one preset parameter;

clustering the plurality of sample time sequence data to obtain at least one group of sample time sequence data;

aiming at each group of sample time sequence data in the at least one group of sample time sequence data, acquiring an expert label of at least one sample time sequence data in the group of sample time sequence data; determining a sample label of each sample time sequence data in the group of sample time sequence data according to the expert labeling label;

and training the classification model by utilizing each sample time sequence data and the sample label corresponding to each sample time sequence data.

In a possible implementation manner, the determining, according to the expert annotation tag, a sample tag of each sample time series data in the set of sample time series data includes:

and under the condition that each expert label in the group of sample time sequence data is the same, taking the expert label as the sample label of each sample time sequence data in the group of sample time sequence data.

In a possible implementation manner, the determining, according to the expert annotation tag, a sample tag of each sample time series data in the set of sample time series data further includes:

under the condition that the expert labeling labels in the group of sample time sequence data are different, returning to the step of clustering the plurality of sample time sequence data; or the like, or, alternatively,

and under the condition that the expert labeling labels in the group of sample time sequence data are different, clustering the group of sample time sequence data to obtain at least one group of new sample time sequence data, and determining the sample label of each sample time sequence data in each group of new sample time sequence data.

In one possible embodiment, the initial stationary characteristic comprises at least one of:

and the mean, variance, covariance, maximum, minimum, skewness and kurtosis of the target time sequence data.

In a possible embodiment, the preset parameters include at least one of:

flow rate; packet loss rate; throughput of the database.

In a possible embodiment, the stationary value corresponding to the stationary characteristic is inversely related to the monitoring threshold.

In a possible embodiment, the performing fault monitoring on the preset parameter of the cloud network device according to the current stationarity feature and the monitoring threshold includes:

and under the condition that the stability value corresponding to the current stability characteristic exceeds the monitoring range corresponding to the monitoring threshold, determining that the preset parameter of the cloud network equipment fails, and generating fault monitoring information.

In a second aspect, an embodiment of the present application provides a fault monitoring apparatus, including:

the cloud network equipment comprises a data acquisition module, a data processing module and a monitoring module, wherein the data acquisition module is used for acquiring target time sequence data which are currently generated by the cloud network equipment and correspond to preset parameters and monitoring threshold values matched with the preset parameters; the monitoring threshold is determined according to the stationarity characteristics of time sequence data corresponding to preset parameters generated when the cloud network equipment operates normally;

the data processing module is used for determining the current stability characteristic of the preset parameter according to the target time sequence data;

and the monitoring module is used for carrying out fault monitoring on the preset parameters of the cloud network equipment according to the current stationarity characteristic and the monitoring threshold value.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory, where the processor implements the method described in any one of the foregoing when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method of any one of the above.

Compared with the prior art, the method has the following advantages:

according to the embodiment of the application, firstly, target time sequence data which is currently generated by cloud network equipment and corresponds to preset parameters and a monitoring threshold value matched with the preset parameters are obtained; the monitoring threshold is determined according to the stationarity characteristics of time sequence data corresponding to preset parameters generated when the cloud network equipment operates normally; then, according to the target time sequence data, determining the current stationarity characteristic of the preset parameter; and finally, carrying out fault monitoring on the preset parameters of the cloud network equipment according to the current stationarity characteristics and the monitoring threshold. When the stability of the preset parameters is poor, the fault monitoring can be more accurately carried out by using a larger monitoring threshold value; when the stability of the preset parameters is good, the fault monitoring can be more accurately carried out by using a smaller monitoring threshold value. The monitoring threshold value obtained in the application is matched with the stationarity characteristics of the preset parameters, the monitoring threshold values matched with the preset parameters of different stationarity characteristics are different, and the monitoring threshold value matched with the stationarity characteristics is used for carrying out fault monitoring on the preset parameters, so that the fault monitoring accuracy can be effectively improved, and fault misinformation is reduced.

The above description is only an overview of the technical solutions of the present application, and the present application may be implemented in accordance with the content of the description so as to make the technical means of the present application more clearly understood, and the detailed description of the present application will be given below in order to make the above and other objects, features, and advantages of the present application more clearly understood.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of its scope.

FIG. 1 is a schematic view of a scenario of a fault monitoring scheme provided herein;

FIG. 2 is a flow chart of a fault monitoring method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a fault monitoring method of another embodiment of the present application;

FIG. 4 is a block diagram of a fault monitoring device according to an embodiment of the present application;

FIG. 5 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

In order to facilitate understanding of the technical solutions of the embodiments of the present application, the following description is made of related art of the embodiments of the present application. The following related arts as alternatives can be arbitrarily combined with the technical solutions of the embodiments of the present application, and all of them belong to the scope of the embodiments of the present application.

Fig. 1 is a schematic diagram of an exemplary application scenario for implementing the method of the embodiment of the present application.

As shown in fig. 1, the cloud network fault monitoring platform 102 is in communication connection with a plurality of cloud network devices 101, and the cloud network fault monitoring platform 102 is configured to collect or receive time series data corresponding to each preset parameter generated by the cloud network devices 101, and process the time series data to implement fault monitoring on the cloud network devices and each preset parameter. As shown in fig. 1, the cloud network device 101 may be a computer, a server, a mobile phone, a cluster, and the like, which is not limited in this application.

Specifically, after acquiring time series data currently generated by the cloud network device 101 corresponding to a certain preset parameter, the cloud network fault monitoring platform 102 acquires a monitoring threshold value matched with the preset parameter, where the monitoring threshold value is determined according to stationarity characteristics of the time series data corresponding to the preset parameter generated when the cloud network device operates normally, and when the stationarity of the preset parameter is good, the monitoring threshold value is set to be smaller, otherwise, when the stationarity of the preset parameter is poor, the monitoring threshold value is set to be larger. Then, the cloud network fault monitoring platform 102 determines a stationarity characteristic of the time series data currently generated by the cloud network device 101, and determines that the preset parameter of the cloud network device has a fault when a stationarity value corresponding to the stationarity characteristic of the current time series data exceeds a monitoring range corresponding to the monitoring threshold. The monitoring threshold value determined by the method is matched with the stationarity characteristics of the preset parameters, the monitoring threshold values matched with the preset parameters of different stationarity characteristics are different, and the monitoring threshold value matched with the stationarity characteristics is used for carrying out fault monitoring on the preset parameters, so that the fault monitoring accuracy can be effectively improved, and fault misinformation is reduced.

An embodiment of the present application provides a fault monitoring method, and as shown in fig. 2, a flowchart of the fault monitoring method according to the embodiment of the present application may include:

s201, acquiring target time sequence data which is currently generated by cloud network equipment and corresponds to preset parameters, and a monitoring threshold value matched with the preset parameters; the monitoring threshold is determined according to the stationarity characteristics of time sequence data corresponding to preset parameters generated when the cloud network equipment operates normally. S202, determining the current stationarity characteristic of the preset parameter according to the target time sequence data. S203, carrying out fault monitoring on the preset parameters of the cloud network equipment according to the current stationarity characteristics and the monitoring threshold.

Illustratively, the preset parameters include index data of the cloud network device, such as traffic, packet loss rate, throughput of the database, and the like.

The target time sequence data may be acquired by the cloud network fault monitoring platform from the cloud network device, or may be transmitted to the cloud network fault monitoring platform after the cloud network device acquires the target time sequence data of the cloud network device, which is not limited in the present application.

For the manner of actively acquiring target time sequence data from the cloud network device by the cloud network fault monitoring platform, the following steps may be specifically executed: before the cloud network fault monitoring platform collects the target time sequence data, a collection time length and a collection frequency can be preset, and then the target time sequence data generated by the cloud network equipment is collected according to the collection time length and the collection frequency.

The cloud network equipment can also acquire data of the cloud network equipment according to the acquisition duration and the acquisition frequency, and the target time sequence data is formed and then sent to the cloud network fault monitoring platform.

In summary, the target time series data includes a series of data collected within the collection duration, and each data is arranged according to the sequence of the collection time.

Illustratively, the stationarity characteristic of the target time series data may be determined by using an enhanced diky-Fuller test (ADF) or a trend stability test (kwatherkowski-Phillips-Schmidt-Shin, KPSS), or may be determined by using both methods, for example, the stationarity characteristic determined by the enhanced diky-Fuller test and the stationarity characteristic determined by the trend stability test may be weighted and summed to obtain the final stationarity characteristic of the target time series data. Among them, the enhanced diky-fowler test is an enhanced stability test method. Illustratively, the stationarity feature can also be determined by using a trained classification model.

Illustratively, the stationarity feature may include stationary timing data, non-stationary timing data, and the like.

Illustratively, different monitoring thresholds are set for different stationarity characteristics in advance, and a mapping relation is formed. For example, when the stationarity characteristic is poor, the monitoring threshold may be set to be larger; when the stationarity of the stationarity feature is good, the monitoring threshold can be set to be small. The stability value can be used for representing the stability characteristic of the target time sequence data, the stability value can be a value distributed between 0 and 1, the better the stability of the target time sequence data is, the larger the stability value is, and the smaller the monitoring threshold value is set at the moment; the worse the stationarity of the target time sequence data is, the smaller the stationarity value of the target time sequence data is, and the larger the monitoring threshold value is set at the moment. The above-mentioned plateau values are associated with monitoring thresholds and may be in a negative correlation relationship.

After the mapping relation between the monitoring threshold and the stationarity feature is formed, the mapping relation is stored, when the monitoring threshold matched with a certain preset parameter is determined, the mapping relation can be firstly obtained, and then the stationarity feature of the time sequence data corresponding to the preset parameter when the cloud network equipment operates normally is determined; and finally, determining a monitoring threshold value matched with the stationarity characteristics of the preset parameters by inquiring the mapping relation.

Illustratively, the following steps can also be utilized to determine the monitoring threshold value matched with the stationarity characteristic of the preset parameter:

firstly, acquiring an incidence relation between stationarity characteristics and a monitoring threshold, wherein the incidence relation can be a calculation formula for calculating the monitoring threshold according to a stationarity value corresponding to the stationarity characteristics; and then, calculating to obtain a monitoring threshold value matched with the stationarity characteristics of the preset parameters by using the calculation formula and the stationarity value of the time sequence data corresponding to the preset parameters generated by the cloud network equipment in the normal state.

The correlation between the stationarity feature and the monitoring threshold may be predetermined according to the following steps:

firstly, determining correlation parameters Q and B between a stability value corresponding to the stability characteristic and a monitoring threshold value. Then, a calculation formula is determined by using the correlation parameters Q and B, for example, the obtained calculation formula is:

and the stability value corresponding to the stability characteristic is a variable in a calculation formula, and the calculation formula and parameters in the calculation formula represent the association relation.

According to the above formula, when the stationary value is small, the calculated monitoring threshold is large, and when the stationary value is large, the calculated monitoring threshold is small. After obtaining the monitoring threshold, determining a monitoring range corresponding to the monitoring threshold, for example, the following range may be used as the monitoring range: (stationary value-monitoring threshold) — (stationary value + monitoring threshold).

Illustratively, Q may be 0.5, b may be 0.55, and if the stationary value is 0.7, the monitoring threshold calculated by the above formula is 0.2, then the monitoring range is 0.5 to 0.9; if the stable value is 0.9, the monitoring threshold value calculated by the formula is 0.1, and then the monitoring range is 0.8 to 1.

For example, for the time series data with good stability, such as the packet loss rate, the stable value may be 0.9, the corresponding fluctuation range, i.e., the monitoring range, should be small and is 0.8 to 1, and when the monitoring range is exceeded, the time series data, such as the packet loss rate, and the corresponding cloud network device are considered to have a fault.

For example, for the time series data with poor stability of the traffic, the stable value may be 0.7, the corresponding fluctuation range, that is, the monitoring range should be large, and is 0.5 to 0.9, when the fluctuation range exceeds the monitoring range, the time series data of the traffic and the corresponding cloud network device are considered to have a fault, and when the fluctuation range does not exceed the monitoring range, the time series data of the traffic and the corresponding cloud network device are considered to have no fault.

For example, after the fault monitoring information is generated, the fault monitoring information may be sent to the cloud network device, so that the cloud network device displays the fault information. Illustratively, the fault type corresponding to the fault monitoring information may include equipment fault, change abnormality, and the like; wherein the change exception may include a software change exception, a hardware change exception, and the like. At least one of an identifier of the preset parameter, a name of the preset parameter, an identifier of the cloud network device, a name of the cloud network device, a fault type, and the like may be included in the fault monitoring information.

The embodiment can automatically output the monitoring threshold value, provides an intelligent solution for the monitoring scene of the cloud network equipment, greatly reduces the number of false alarm alarms, reduces the burden of operation and maintenance personnel, and improves the stability of the system. In addition, the operation and maintenance personnel can refer to the monitoring threshold value and the actual scene, adjust the monitoring threshold value to further improve the accuracy of fault monitoring, reduce the number of false alarm alarms and continuously improve the operation and maintenance efficiency.

In some embodiments, the stationarity characteristic of the target time series data corresponding to the preset parameter may be determined by:

firstly, determining initial stationary characteristics of target time sequence data corresponding to the preset parameters; illustratively, the enhanced diky-fowler test or the trend stability test kwikowski-Phillips-Schmidt-Shin may be used to determine the initial stationary characteristic of the target time series data, or both methods may be used to determine the initial stationary characteristic, for example, the initial stationary characteristic determined by the enhanced diky-fowler test and the initial stationary characteristic determined by the trend stability test may be combined, and the combined result is used as the final initial stationary characteristic of the target time series data. Illustratively, the initial stationary features include at least one of: and the mean, variance, covariance, maximum value, minimum value, skewness and kurtosis of the target time sequence data. Here, skewness (skewness) is a characteristic of a skew direction and a skew degree (asymmetry degree) of statistical data distribution. Kurtosis (kurtosis), also called as kurtosis coefficient, is used to characterize the peak height of a probability density distribution curve at the average value, and visually, the kurtosis reflects the sharpness of the peak.

And then, transmitting the initial stationary characteristic and the target time sequence data into a classification model, processing the initial stationary characteristic and the target time sequence data by the classification model, and outputting the stationary characteristic.

The classification model is obtained by utilizing a large number of training samples to train in advance, and the precision of the classification model meets preset conditions, for example, the accuracy rate is greater than 60%, and the recall rate is greater than 70%. Meanwhile, the classification model can realize the small-scale classification of million-scale time series data.

Illustratively, the classification model may include a multi-layer Long Short-Term Memory artificial neural network (LSTM).

In the field of artificial intelligence, in a real scene, the data size is huge, the data does not have labels, and if the data is used as sample data of a training model, the sample data needs to be labeled manually to generate a sample label of the sample data. Because the sample data requirement of the training model is large, enormous manpower and time cost are required to label the sample label.

In view of the above-mentioned drawbacks, the present application provides a method for generating a sample tag of sample data, where the sample data is the following sample timing sequence data, and the method can be implemented by using the following steps:

firstly, acquiring a plurality of sample time sequence data corresponding to at least one preset parameter; clustering the plurality of sample time sequence data to obtain at least one group of sample time sequence data; finally, aiming at each group of sample time sequence data in the at least one group of sample time sequence data, acquiring an expert label of at least one sample time sequence data in the group of sample time sequence data; and determining the sample label of each sample time sequence data in the group of sample time sequence data according to the expert labeling label.

Illustratively, the cloud network fault monitoring platform may actively collect sample time series data from the cloud network device, and may specifically be implemented by performing the following steps: the cloud network fault monitoring platform can preset a collection time length and a collection frequency before collecting the sample time sequence data, and then collects the sample time sequence data of the cloud network equipment according to the collection time length and the collection frequency. For example, the acquisition duration may be set to 14 days, and the acquisition frequency may be set to 1 minute, that is, the cloud network fault monitoring platform acquires data of the cloud network device in a manner of acquiring data once per minute and continuously acquiring data for 14 days, and the obtained data of some columns form the sample time series data.

The cloud network equipment can also collect data of the cloud network equipment according to the collection duration and the collection frequency, and the data are sent to the cloud network fault monitoring platform after sample time sequence data are formed.

Illustratively, the multiple sample timing data may be clustered using a Density-Based Clustering of Applications with Noise (DBSCAN) method. Specifically, the method clusters sample time-series data with the number of grades of hundred million, and clusters the sample time-series data into thousands of categories according to shapes, densities and the like, wherein each category corresponds to one group of sample time-series data.

Each sample time sequence data in a group of sample time sequence data obtained by clustering has similar data characteristics, at least one sample time sequence data can be manually labeled to obtain an expert label of the sample time sequence data, and then sample labels of other unlabeled sample time sequence data in the group are determined according to the obtained expert label. Therefore, the method avoids manual marking of all sample time sequence data, saves manpower resources, and can improve marking speed, thereby being beneficial to improving model training efficiency.

Illustratively, the accuracy of the expert labeling labels obtained by manual labeling is high, and here, the expert labeling labels obtained by manual labeling can be used as reference information to determine the sample labels corresponding to each sample time series data in the group. Or directly using the manually labeled expert label as the sample label of each corresponding group of sample time sequence data. Specifically, if one sample time series data in a certain group of sample time series data is manually labeled, an expert label is obtained, and at this time, the expert label can be directly used as a sample label of each sample time series data in the group of sample time series data. If the group of sample time sequence data is marked with a plurality of sample time sequence data manually, a plurality of expert marking labels are obtained, then the expert marking labels are compared, if the expert marking labels are the same, the clustering result obtained by clustering the sample time sequence data is more accurate, at the moment, the class of each sample time sequence data in the group is the same, and therefore the expert marking labels can be used as the sample labels of each sample time sequence data in the group of sample time sequence data. If the plurality of expert labels are different, it is indicated that the clustering result obtained by clustering the plurality of sample time sequence data is not accurate enough, and the group of data needs to be clustered again, or all sample time sequence data need to be clustered again. When the group of data is re-aggregated, a plurality of groups of new sample time series data are obtained, and then the sample label of each newly formed group of sample time series data is determined according to the step of determining the sample label of the group of sample time series data, and repeated parts are not repeated. And when all the sample time sequence data are clustered again, returning to the step of clustering the plurality of sample time sequence data to obtain a plurality of groups of new sample time sequence data, and repeating the operation of determining the sample label of each sample time sequence data in the corresponding group according to the artificially labeled expert label. After obtaining the sample labels of each sample timing sequence data, a classification model may be trained using each of the sample timing sequence data and the corresponding sample label.

In addition, the training classification model can also combine the initial stationary features of each sample time sequence data, so that the initial stationary features of each sample time sequence data can be determined by using an enhanced diji-fowler test, or a trend stability test, such as kwatkowski-Phillips-Schmidt-Shin, before the training classification model is trained, or the initial stationary features can be determined by using the two methods together, for example, the initial stationary features determined by the enhanced diji-fowler test and the initial stationary features determined by the trend stability test can be combined, and the combined result is used as the final initial stationary features of the sample time sequence data.

Specifically, when the classification model is trained, each sample time sequence data and the initial stationary features of each sample time sequence data are input into the classification model which is not trained, and after the classification model processes the input data, the classification result of each sample time sequence data is output. And then determining training loss according to the classification result of each sample time sequence data and the sample label of each sample time sequence data, and then adjusting parameters in the classification model which is not trained according to the training loss. After adjustment, iterative training is continuously carried out according to the above mode until the training times are higher than the preset times or the accuracy of the classification model is higher than the preset accuracy, and the trained classification model is obtained.

The embodiment obtains a large number of training samples with labels (including sample data and sample labels) under a real scene with lower labor cost and time cost, then performs semi-supervised classification model training on the part of training samples with labels to obtain a classification model, and the previous sample data with labels is used for evaluating the classification result of the classification model besides being used for training the model.

After the classification model is obtained, the classification model is utilized to determine the stationarity characteristics of the preset parameters in the normal state, and determine the monitoring threshold matched with the stationarity characteristics to assist operation and maintenance personnel in fault monitoring configuration, or the monitoring threshold is directly utilized to perform fault monitoring on the cloud network equipment and the preset parameters.

The labeling modes of the sample labels in the related technology basically comprise two modes, namely, expert labeling (namely manual labeling) and crowdsourcing labeling, the accuracy rate of the expert labeling mode is high, the cost is high, the accuracy rate of the crowdsourcing labeling mode is low, the cost is low, a large number of sample labels are rapidly generated by combining clustering and manual labeling, the accuracy rate is high, the cost is low, and rapid sample labeling in a real scene is realized. In addition, according to the method and the device, classification results based on the classification models, namely the stationarity features are automatically generated or matched with the monitoring threshold values, in practical application, the stationarity features and the corresponding monitoring threshold values can be provided when operation and maintenance personnel configure alarm items, intelligent fault monitoring configuration capacity is provided for cloud network operation and maintenance scenes, compared with the manual configuration of the monitoring threshold values, the cost is lower, false alarm is less, and the cloud network operation and maintenance efficiency is greatly improved.

The fault monitoring method of the present application is further described below with an embodiment.

As shown in fig. 3, the fault monitoring method of this embodiment may include 3 stages of data labeling, time sequence classification, and monitoring configuration. The data labeling stage is used for generating a large amount of sample time sequence data with sample labels; the time sequence classification stage is used for determining a classification result of stationarity of preset parameters, namely stationarity characteristics of the preset parameters; the monitoring configuration stage is used for determining a monitoring threshold value matched with the stationarity characteristics and configuring the fault alarm items of the cloud network equipment and the preset parameters according to the monitoring threshold value; then, the cloud network device and the preset parameter may be fault-monitored by using the monitoring threshold.

And a data annotation stage: the cloud network fault monitoring platform collects sample time sequence data which are generated by a plurality of cloud network devices and correspond to a plurality of preset parameters according to preset collection time and collection frequency. For example, the acquisition duration may be set to 14 days, and the acquisition frequency may be set to 1 minute, that is, the cloud network fault monitoring platform acquires data of the cloud network device in a manner of acquiring data once per minute and continuously acquiring data for 14 days, and the obtained data of some columns form the sample time series data.

Clustering the plurality of sample time sequence data by using a DBSCAN clustering mode to obtain at least one group of sample time sequence data, namely obtaining at least one category, wherein thousands of categories can be obtained by clustering in practical application. And manually labeling at least one sample time sequence data corresponding to each category, and then determining the sample label of each sample time sequence data corresponding to each category according to the manually labeled expert label. Therefore, a large amount of sample time sequence data with sample labels under representative real scenes are obtained quickly.

A time sequence classification stage: and respectively determining initial stationary characteristics of the sample time sequence data by using the ADF and the KPSS, wherein the ADF and the KPSS are used for measuring the stationary of the data from the statistical angle, and performing stationary classification on the sample time sequence data to obtain the initial stationary characteristics. And then fusing the initial stable characteristics obtained by the two modes to obtain the final initial stable characteristics.

Fitting sample time sequence data by using six layers of LSTM (namely classification model), wherein the sample time sequence data and initial stationary characteristics are input during model training; and then determining training loss according to the classification result of stationarity output by the LSTM, namely the stationarity characteristic of the sample time sequence data and the sample label of each sample time sequence data, and then adjusting parameters in the classification model which is not trained according to the training loss. After adjustment, iterative training is continuously carried out according to the above mode until the training times are higher than the preset times or the accuracy of the classification model is higher than the preset accuracy, and the trained classification model is obtained.

And transmitting the initial stationary characteristic of the time sequence data corresponding to a certain preset parameter and the time sequence data corresponding to the preset parameter into a classification model, processing the received initial stationary characteristic and the received time sequence data by the classification model, and outputting a classification result of the time sequence data, namely the stationary characteristic of the preset parameter.

The steps of training the classification model in the operation and time sequence classification stages corresponding to the data labeling stages are completed in advance, and the classification model can classify various time sequence data.

A monitoring configuration stage: determining a monitoring threshold value matched with the stationarity characteristic of the preset parameter, for example, determining or matching a relatively stable preset parameter to a smaller monitoring threshold value, for example, the monitoring threshold value is 0.1; a relatively large monitoring threshold, for example 0.5, is determined or adapted for a relatively unstable preset parameter, so that normal fluctuations of the preset parameter are not mistaken for a fault because the monitoring threshold is set too small.

In addition, the stability of the preset parameters can change along with the change of time, and the dynamic monitoring threshold matched with the preset parameters can be set by using the scheme of the application, so that the setting or updating of the monitoring threshold is more flexible, the number of false alarm alarms is further reduced, the pressure of operation and maintenance personnel is reduced, and the stability of equipment on the cloud network is guaranteed.

And after the monitoring threshold is configured, configuring the fault alarm item according to the monitoring threshold.

After the monitoring threshold of the preset parameter is configured, the monitoring threshold may be used to perform fault monitoring on the preset parameter. The embodiment includes a scheme of clustering and expert labeling the time series data corresponding to each preset parameter and classifying the time series data by using a classification model, and can quickly generate a large number of sample labels, so that a key automatic and high-performance stability classification capability is provided for a cloud network fault monitoring platform or a cloud network intelligent operation and maintenance management platform (AIOps). And subsequently, the matched monitoring threshold value can be determined by combining the stationarity characteristic of the preset parameter, namely the characteristic of the preset parameter, so that an intelligent solution is provided for operation and maintenance personnel to configure a fault monitoring item, the number of false alarm alarms is greatly reduced, the operation and maintenance labor and time cost is reduced, and the AIOps can fall to the ground in a cloud network operation and maintenance scene.

Corresponding to the application scenario and the method of the method provided by the embodiment of the application, the embodiment of the application further provides a fault monitoring device. Fig. 4 is a block diagram illustrating a fault monitoring apparatus according to an embodiment of the present invention, where the fault monitoring apparatus may include:

the data acquisition module 410 is configured to acquire target time sequence data, which is currently generated by the cloud network device and corresponds to preset parameters, and a monitoring threshold value matched with the preset parameters; the monitoring threshold is determined according to the stationarity characteristics of time sequence data corresponding to preset parameters generated when the cloud network equipment operates normally.

And the data processing module 420 is configured to determine a current stationarity characteristic of the preset parameter according to the target time sequence data.

And the monitoring module 430 is configured to perform fault monitoring on the preset parameter of the cloud network device according to the current stationarity characteristic and the monitoring threshold.

In some embodiments, the data obtaining module 410 is further configured to determine, according to the target time series data, a current stationarity characteristic of the preset parameter:

In some embodiments, the data acquisition module 410 is further configured to train a classification model:

In some embodiments, the data obtaining module 410, when determining the sample label of each sample time series data in the set of sample time series data according to the expert annotation label, is configured to:

In some embodiments, the data obtaining module 410, when determining the sample label of each sample time series data in the set of sample time series data according to the expert annotation label, is further configured to:

In some embodiments, the initial smoothness feature comprises at least one of:

and the mean, variance, covariance, maximum value, minimum value, skewness and kurtosis of the target time sequence data.

In some embodiments, the preset parameters include at least one of:

flow rate; packet loss rate; throughput of the database.

In some embodiments, there is an inverse proportional relationship between the stationary value corresponding to the stationary characteristic and the monitoring threshold.

The functions of each module in each device in the embodiment of the present application can be referred to the corresponding description in the above method, and have corresponding beneficial effects, which are not described herein again.

FIG. 5 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 5, the electronic apparatus includes: a memory 510 and a processor 520, the memory 510 having stored therein computer programs that are executable on the processor 520. The processor 520, when executing the computer program, implements the method in the above embodiments. The number of the memory 510 and the processor 520 may be one or more.

The electronic device further includes:

the communication interface 530 is used for communicating with an external device to perform data interactive transmission.

If the memory 510, the processor 520, and the communication interface 530 are implemented independently, the memory 510, the processor 520, and the communication interface 530 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 510, the processor 520, and the communication interface 530 are integrated on a chip, the memory 510, the processor 520, and the communication interface 530 may complete communication with each other through an internal interface.

Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and run an instruction stored in a memory from the memory, so that a communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an Advanced reduced instruction set machine (ARM) architecture.

Further, optionally, the memory may include a read-only memory and a random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may include a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can include Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM may be used. For example, static Random Access Memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), enhanced SDRAM (ESDRAM), SLDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method described in a flowchart or otherwise herein may be understood as representing a module, segment, or portion of code, which includes one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps described in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only an exemplary embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope described in the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A fault monitoring method, comprising:

monitoring the fault of the preset parameter of the cloud network equipment according to the current stationarity characteristic and the monitoring threshold;

wherein a stationary value corresponding to the stationary characteristic is inversely related to the monitoring threshold; the monitoring range corresponding to the monitoring threshold comprises: (stationary value-monitoring threshold) — (stationary value + monitoring threshold);

the monitoring the fault of the preset parameter of the cloud network device according to the current stationarity characteristic and the monitoring threshold includes:

and determining that the preset parameter of the cloud network equipment fails and generating fault monitoring information under the condition that the stability value corresponding to the current stability characteristic exceeds the monitoring range corresponding to the monitoring threshold.

2. The method according to claim 1, wherein the determining the current stationarity characteristic of the preset parameter according to the target time series data comprises:

3. The fault monitoring method of claim 2, wherein the classification model is obtained by training, the training comprising:

4. The method of claim 3, wherein determining the sample label for each sample timing data in the set of sample timing data based on the expert annotation label comprises:

5. The method of claim 4, wherein determining the sample label for each sample timing data in the set of sample timing data according to the expert annotation label further comprises:

6. The fault monitoring method of claim 2, wherein the initial plateau feature comprises at least one of:

7. The fault monitoring method according to any one of claims 1 to 6, wherein the preset parameters include at least one of:

traffic, packet loss rate, throughput of the database.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-7 when executing the computer program.

9. A computer-readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1-7.