CN113312244A

CN113312244A - Fault monitoring method, equipment, program product and storage medium

Info

Publication number: CN113312244A
Application number: CN202110858765.9A
Authority: CN
Inventors: 柏健; 肖雄; 吕彪; 刘昊俣
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-08-27

Abstract

The embodiment of the application provides a fault monitoring method, equipment, a program product and a storage medium. In the embodiment of the application, the running state characteristics corresponding to each monitoring object can be automatically extracted in the process of fault monitoring of each monitoring object in the cloud network, and the monitoring basis adapted to each monitoring object can be determined based on the running state characteristics, so that the corresponding monitoring basis can be dynamically and adaptively updated along with the continuous change of the running state of each monitoring object, and the continuously updated monitoring basis can be better adapted to the complex and changeable running process of the cloud network, thereby effectively improving the accuracy of fault monitoring of each monitoring object in the cloud network, reducing the fault alarm noise and preventing the false alarm.

Description

Fault monitoring method, equipment, program product and storage medium

Technical Field

The present application relates to the field of cloud network technologies, and in particular, to a fault monitoring method, device, program product, and storage medium.

Background

In the operation process of the cloud network, fault monitoring is generally required to be performed on a cluster in the cloud network, and component faults are discovered and repaired in time, so that influences on cloud network users are avoided.

In the current fault monitoring scheme, an empirical value of a flow index is set manually, and in the fault monitoring process, the empirical value is used as a basis for judging whether the flow of a cluster is abnormal or not, so that the cluster fault is found based on the flow abnormality. However, the operation process of the cloud network is complex and variable, and the empirical value is too rigid, so that misjudgment often occurs, a large amount of fault alarm noise is generated, and extra burden is brought to the maintenance work of the cloud network.

Disclosure of Invention

Aspects of the present application provide a fault monitoring method, device, program product, and storage medium to improve accuracy of fault monitoring in a cloud network.

The embodiment of the application provides a fault monitoring method, which comprises the following steps:

responding to the monitoring basis updating instruction, and extracting the running state characteristics corresponding to the target monitoring object in the cloud network;

inputting the running state characteristics corresponding to the target monitoring object into a monitoring basis determination model;

in the monitoring basis determination model, determining a target monitoring basis adapted to the running state characteristics corresponding to the target monitoring object based on the mapping relation between the running state characteristics and the monitoring basis;

and updating the monitoring basis corresponding to the target monitoring object as the target monitoring basis.

The embodiment of the application also provides a computing device, which comprises a memory and a processor;

the memory is to store one or more computer instructions;

the processor is coupled with the memory for executing the one or more computer instructions for:

Embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the aforementioned fault monitoring method.

In the embodiment of the application, the running state characteristics corresponding to each monitoring object can be automatically extracted in the process of fault monitoring of each monitoring object in the cloud network, and the monitoring basis adapted to each monitoring object can be determined based on the running state characteristics, so that the corresponding monitoring basis can be dynamically and adaptively updated along with the continuous change of the running state of each monitoring object, and the continuously updated monitoring basis can be better adapted to the complex and changeable running process of the cloud network, thereby effectively improving the accuracy of fault monitoring of each monitoring object in the cloud network, reducing the fault alarm noise and preventing the false alarm.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart of a fault monitoring method according to an exemplary embodiment of the present application;

FIG. 2 is a logic diagram of a fault monitoring scheme provided by an exemplary embodiment of the present application;

fig. 3 is a schematic structural diagram of a fault monitoring system according to another exemplary embodiment of the present application;

fig. 4 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, in the process of monitoring a fault of a cloud network, an experience value of a flow index is usually set manually and is used as a reference for judging abnormal flow, so that the fault is found, but the experience value is too rigid, so that misjudgment often occurs, a large amount of fault alarm noise is generated, and extra burden is brought to maintenance work of the cloud network. To this end, in some embodiments of the present application: the method can automatically extract the running state characteristics corresponding to each monitoring object in the process of fault monitoring of each monitoring object in the cloud network, and can determine the monitoring basis adapted to each monitoring object based on the running state characteristics, so that the corresponding monitoring basis can be dynamically and adaptively updated along with the continuous change of the running state of each monitoring object, and the continuously updated monitoring basis can be better adapted to the complex and changeable running process of the cloud network, thereby effectively improving the accuracy of fault monitoring of each monitoring object in the cloud network.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a fault monitoring method according to an exemplary embodiment of the present application, and fig. 2 is a schematic logic diagram of a fault monitoring scheme according to an exemplary embodiment of the present application. Wherein the method may be performed by a fault monitoring device, which may be implemented as a combination of software and/or hardware, which may be integrated in a computing device. In this embodiment, the computing device may be a cloud server, a node, a container group, or the like in a cloud network, the computing device has a monitoring right for each monitored object in the cloud network, and the computing device may acquire the operating state data of each monitored object.

Based on this, referring to fig. 1, the fault monitoring method provided in this embodiment may include:

step 100, responding to a monitoring basis updating instruction, and extracting running state features corresponding to a target monitoring object in a cloud network;

step 101, inputting running state characteristics corresponding to a target monitoring object into a monitoring basis determination model;

102, in a monitoring basis determination model, determining a target monitoring basis adapted to the running state characteristics corresponding to a target monitoring object based on the mapping relation between the running state characteristics and the monitoring basis;

and 103, updating the monitoring basis corresponding to the target monitoring object to be the target monitoring basis.

The fault monitoring method provided by the embodiment can be applied to a fault monitoring scene of a cloud network, wherein a target monitoring object in the embodiment can be a cluster, a component, a node, a server, a user instance and the like deployed in the cloud network, and the embodiment does not limit the scale, the type and other attributes of the target monitoring object. The fault monitoring method provided by the embodiment can be used for monitoring faults of all monitored objects in the cloud network and timely discovering and positioning the fault in the cloud network. In addition, for convenience of description, the fault monitoring scheme will be described by taking a target monitoring object in the cloud network as an example, but it should be understood that the target monitoring object may be any one of monitoring objects included in the cloud network.

Referring to fig. 1 and 2, in step 100, an update of the monitoring basis of the target monitoring object may be initiated in response to the monitoring basis update instruction. In practical applications, the monitoring may be initiated periodically according to the update instruction, for example, periodically or at a specified discrete time point, and of course, the present embodiment is not limited thereto, and the monitoring may also be initiated at an irregular time according to the update instruction. In addition, in this embodiment, a monitoring basis updating instruction may be initiated for all monitoring objects in the cloud network synchronously to update the monitoring basis of each monitoring object synchronously, and certainly, a monitoring basis updating instruction may also be initiated for different monitoring objects in the cloud network individually, that is, the monitoring basis updating process of each monitoring object is relatively independent.

In this embodiment, the operation state features corresponding to the target monitoring object in the cloud network may be extracted, where the operation state features are used to represent the operation state of the target monitoring object, and the operation state of the target monitoring object dynamically changes, so that for the target monitoring object, the extracted operation state features may not be completely the same in different monitoring basis updating processes. In this embodiment, the operation status features may include, but are not limited to, a flow stability feature, a change status feature, a user distribution feature, a single user flow status feature, a flow fluctuation rate, a transmission delay feature, a future flow prediction feature, and the like. Wherein, the flow stability characteristic can be used for representing whether the flow change on the target monitoring object is stable or not. The change status feature may be used to represent whether the target monitoring object is undergoing a change, and in this embodiment, the change platform may be docked to obtain change information of the target monitoring object, so as to generate the change status feature. The user distribution characteristics can be used for representing attributes such as the number and types of users of the target monitoring object load. The individual user traffic status feature can be used to characterize the traffic change of the individual user of the target monitoring object load, for example, whether a sudden increase or a sudden decrease occurs or not. The flow fluctuation rate can be used to characterize the degree of flow fluctuation of the target monitoring object. The transmission delay characteristic can be used for characterizing the delay condition of a transmission channel where the target monitoring object is located. The future flow prediction features may be used to characterize the flow of the target monitoring object over a future period of time.

It should be noted that, in the present embodiment, the motion state characteristics are not limited to the above-mentioned exemplary characteristics, but more characteristics may be added freely to continuously optimize the determination basis of the monitoring basis.

In this embodiment, a machine learning technique may be introduced, and the machine learning model is used to extract the operating state features corresponding to the target monitoring object. In practical application, different running state features can be extracted by adopting different machine learning models. The machine learning model for extracting different running state features can learn model parameters in a supervised learning mode in advance, training samples in the learning process can come from a historical monitoring process, and the learning process can adopt a machine learning scheme which may appear currently or in the future, and is not detailed herein.

Taking a target feature in the running state features corresponding to the target monitoring object as an example, in this embodiment, the running state data of the target monitoring object may be obtained; determining target data having influence on the target characteristics in the running state data; inputting the target data into a specific machine learning model for extracting target features; target features are extracted based on the target data using a particular machine learning model. The target characteristic may be any one of the operating state characteristics corresponding to the target monitoring object.

For example, for flow stability characteristics, flow trend data of a target monitoring object may be obtained and input into a particular machine learning model for extracting flow stability characteristics, where the flow stability characteristics of the target monitoring object may be predicted based on a pearson correlation system. For example, if the pearson correlation coefficient corresponding to the target object is greater than the threshold value (e.g., 0.99), the flow stability characteristic of the target object may be determined to be stable, otherwise, if the pearson correlation coefficient corresponding to the target object is less than the threshold value (e.g., 0.99), the flow stability characteristic of the target object may be determined to be unstable.

In practical application, the partial operation state features can be directly determined according to the known operation state data without depending on a machine learning model. For example, in the present embodiment, the change status feature may be interfaced with a change platform, and the change information of the target monitoring object may be acquired from the change platform, so as to directly determine whether the change status feature of the target monitoring object is "changed" or "unchanged" according to the change information.

On the basis, in step 101, the operation state feature corresponding to the target monitoring object may be input into the monitoring basis determination model. As mentioned above, the operating condition characteristics corresponding to the target object may be one or more, and the one or more operating condition characteristics may be synchronously input into the monitoring basis determination model. The monitoring basis determining model can determine the currently adopted monitoring basis of the target monitoring object. In this embodiment, the monitoring basis determining model may be a machine learning model that outputs the monitoring basis as a work target. The monitoring basis determining model may adopt a machine learning model for solving the classification problem, such as LSTM, CNN, BERT, etc., which is not limited in this embodiment.

In step 102, a target monitoring basis adapted to the operating state feature corresponding to the target monitoring object may be determined in the monitoring basis determination model based on the mapping relationship between the operating state feature and the monitoring basis. The monitoring basis determination model can learn the mapping relation between the running state characteristics and the monitoring basis. In this embodiment, the monitoring basis determination model may be trained in a supervised learning manner. The training process may be:

acquiring a plurality of monitoring samples, wherein the monitoring samples comprise running state characteristics corresponding to monitoring objects and adopted monitoring basis;

and inputting the running state characteristics contained in the monitoring samples and the adopted monitoring basis into a monitoring basis determination model so as to enable the monitoring basis determination model to learn the mapping relation between the running state characteristics and the monitoring basis.

Accordingly, in the embodiment, the machine learning technology and the operation and maintenance monitoring field can be effectively combined, so that the capability of adaptively adjusting monitoring bases for different target monitoring objects is continuously optimized in a self-learning mode, and the accuracy of fault monitoring is further improved.

For the operation state features corresponding to the monitoring objects contained in the monitoring samples, the aforementioned feature extraction scheme may be adopted for feature extraction.

In this embodiment, the monitoring samples may be derived from a plurality of historical monitoring processes that occur before the monitoring basis determination model is used, for example, the operation state features corresponding to the monitoring object and the adopted monitoring basis may be extracted from the monitoring records generated in the historical monitoring processes to construct the monitoring samples. The monitoring basis contained in the monitoring sample can be marked by adopting modes such as manual marking and the like, certainly, the historical monitoring process with the correct monitoring result can be screened out from the mass historical monitoring process, and the historical monitoring process can be used as the monitoring sample, and under the condition, the monitoring basis in the monitoring sample can directly follow the real monitoring basis corresponding to the historical monitoring process.

In this embodiment, the monitoring samples may also be derived from a number of new monitoring processes that occur during the use of the monitoring basis determination model. Similarly, newly added monitoring samples can be continuously marked in the newly added monitoring processes, and the newly added monitoring samples can be used for continuously optimizing monitoring basis to determine model parameters of the model. Therefore, model parameters of the monitoring basis determination model can be continuously optimized through continuous training of the monitoring basis determination model, and the mapping relation between the running state characteristics in the monitoring basis determination model and the monitoring basis can be continuously updated during the use period of the model.

In addition, in the embodiment, the monitoring basis covered by the monitoring basis determination model can be freely expanded. In this embodiment, labeling and other work can be performed on the monitoring records to obtain monitoring samples related to the new monitoring basis, and the new monitoring samples can be continuously used for training the monitoring basis determination model so that the monitoring basis determination model can learn the mapping relationship between the operating state features and the new monitoring basis. In addition, for the case of adding a new monitoring basis, in this embodiment, the previous monitoring samples (corresponding to the above-mentioned historical monitoring process) may be used to perform replay verification on the current update of the monitoring basis determination model, that is, if the prediction result is consistent with the monitoring basis contained in the previous monitoring samples when the monitoring basis prediction is performed on the previous monitoring samples by using the model parameters of the current update in the monitoring basis determination model, the current updating of the monitoring basis determination model is allowed, otherwise, the current updating of the monitoring basis determination model is not allowed. This may ensure stability of the monitoring according to the determination model.

In this embodiment, the monitoring criteria may include: monitoring parameters and/or monitoring strategies. Thus, the target monitoring criterion determined for the target monitoring object may comprise at least one target monitoring parameter and/or at least one monitoring strategy. The monitoring parameters may include, but are not limited to, a traffic monitoring threshold, a transmission delay time threshold, a number of users threshold, a traffic fluctuation rate threshold, an abnormal state duration threshold, and the like. The monitoring strategy may include, but is not limited to, a traffic mutation sensing strategy, a statistical analysis strategy, a transmission delay event merging strategy, an index association strategy, a persistence sensing strategy, or a fitting distribution strategy. In this embodiment, a policy center may be configured, and logic contents of each monitoring policy are preset in the policy center. The monitoring strategies in this embodiment are not limited to the above exemplary strategies, but more monitoring strategies may be added continuously to monitor the faults in the cloud network in a more multidimensional manner.

In this embodiment, some of the monitoring policies are monitoring policies based on monitoring parameters, and such monitoring policies need to run policy logic based on the monitoring parameters. In this embodiment, the monitoring parameters referred to by such a monitoring policy may be configured in advance. Based on this, for example, the first monitoring policy included in the target monitoring basis determined for the target monitoring object, in this embodiment, the first monitoring parameter corresponding to the first monitoring policy may be determined from at least one target monitoring parameter included in the target monitoring basis; under a first monitoring strategy, fault monitoring is carried out on a target monitoring object based on a first monitoring parameter; the first monitoring strategy is any one of at least one monitoring strategy contained in the target monitoring basis. For example, for the flow sudden change perception strategy above, whether the flow sudden change occurs may be determined based on a flow monitoring threshold. For another example, for the statistical analysis strategy in the foregoing, it may be determined whether the number of users with sudden traffic change exceeds the user number threshold among the users with the target monitoring target load based on the user number threshold.

Taking the flow rate abrupt change perception strategy as an example, in this embodiment, the standard score may be adopted as the flow rate monitoring threshold value when the flow rate stability characteristic of the target monitoring object is "stable", and the bottom monitoring threshold value at which the flow rate falls to 0 may be reserved as the flow rate monitoring threshold value when the flow rate stability characteristic of the target monitoring object is "unstable". Wherein, the flow mutation perception strategy based on the standard score can be as follows:

the ascending trend or the descending trend of the flow trend of the current time point along with the flow trend of the preceding time period (for example, the previous 5 time points) is calculated by using a standard fraction stationardscore calculation formula. The standardscore calculation formula may be: standardocre = ($ cur- $ avg)/$ stddev, wherein $ cur is a flow trend representation value at the current time point; avg is the mean value of the flow trend representation values of all time points in the time period, and stddev is the standard deviation of the flow trend representation values of all time points in the time period. Based on this, when the calculated value of standardsocre is 0, it indicates that the flow trend is stable, when the calculated value is greater than 0, it indicates that the flow trend appears upward, and when the calculated value is less than 0, it indicates that the flow trend appears downward. Summarizing actual application, if the calculated value is greater than 3 (of course, other threshold values can be used), determining that the target monitoring object has a sudden flow increase; if the calculated value is less than-3 (of course, other thresholds are possible), it may be determined that a sudden flow drop has occurred in the target monitoring object.

In this embodiment, some monitoring strategies do not depend on monitoring parameters, but can directly perform fault monitoring based on known information. For example, for the transmission delay event merging strategy in the foregoing, if there is a transmission delay occurring in the area a in the cloud network, and the area a is a transmission node in the transmission link where the target monitoring object is located, the transmission delay event in the area a may be merged onto the transmission link where the target monitoring object is located, so as to determine that there is a transmission delay in the transmission link where the target monitoring object is located.

Accordingly, a target monitoring basis matched with the target monitoring object determination can be determined by using the monitoring basis determination model, and the target monitoring basis can comprise at least one target monitoring parameter and/or at least one monitoring strategy.

Based on this, in step 102, the monitoring basis corresponding to the target monitoring object may be updated to the target monitoring basis. Thereafter, in this embodiment, the fault monitoring of the target monitoring object may be performed according to the target monitoring basis in response to the fault monitoring instruction for the target monitoring object.

As mentioned above, the target monitoring basis determined for the target monitoring object may include one or more monitoring strategies, and in the case that there are a plurality of monitoring strategies included in the target monitoring basis, in this embodiment, fault monitoring may be performed under a plurality of monitoring strategies, respectively; and if the fault results monitored under the multiple monitoring strategies meet the preset requirements, determining that the target monitoring object has a fault. That is, in this case, a plurality of monitoring strategies included in the target monitoring basis may be fused and used as a fault judgment strategy.

For example, if the target monitoring basis includes a traffic sudden change sensing policy and a statistical analysis policy, it may be determined whether the traffic of the target monitoring object has a sudden change based on the traffic sudden change sensing policy, and if the traffic of the target monitoring object has a sudden change, it may be determined whether the traffic of the target monitoring object has a sudden change caused by a single user or a plurality of users based on the statistical analysis policy, and if the traffic of the target monitoring object has a sudden change caused by a plurality of users, it may be determined that the target monitoring object has a failure. And if the sudden change of the flow of the target monitoring object is not caused by a plurality of users but caused by a single user, the target monitoring object is determined not to be in failure, and the sudden change of the flow is probably caused by the sudden change of the flow of the single user and not caused by the failure of the target monitoring object.

In this example, the target monitoring object is subjected to fault monitoring in two monitoring strategies respectively, and the detection results under the two monitoring strategies meet the preset requirements, so that the conclusion that the target monitoring object is in fault is obtained. It should be understood that the logic of the traffic mutation perception policy and the statistical analysis policy is merely exemplary, and the present embodiment is not limited thereto.

Optionally, in the application process of the monitoring policy, if it is determined that the target monitoring object meets the corresponding fault standard under the multiple monitoring policies, it is determined that the target monitoring object fails. That is, if it is determined that the target monitoring object has a fault under various monitoring strategies, a conclusion that the target monitoring object has a fault indeed can be obtained, otherwise it is determined that the target monitoring object has not a fault but has a misjudgment under some monitoring strategies.

In addition, in the embodiment, a monitoring interface can be further provided, and based on the monitoring interface, a fault monitoring result can be output in the monitoring interface after fault monitoring is completed. In addition, fault alarm information can be output under the condition that a fault occurs on a target monitoring object, for example, the fault alarm information is realized in a popup window mode or a buzzer, a voice broadcast mode or the like on a monitoring interface. Based on the above, the fault monitoring scheme of the embodiment can locate the fault in the cloud network, and therefore, the fault occurrence position can be mentioned in the fault alarm information. On the basis, a fault processing instruction can be sent to a maintainer corresponding to the fault occurrence position in response to the fault guarantee operation which is sent to the fault occurrence position and occurs in the monitoring interface, so that the maintainer is prompted to solve the fault in time. In addition, under the condition that a fault occurs on the target monitoring object, a fault monitoring record can be output in the monitoring interface, the fault monitoring record can comprise monitoring parameters, monitoring strategies and running state data of the target monitoring object used in the current monitoring process, and the fault monitoring record can be manually checked; based on the method, the target monitoring object can be determined to be in fault in response to the confirmation operation of the fault monitoring record in the monitoring interface. Thereafter, the above-described post-failure handling scheme may continue to be performed.

In summary, in this embodiment, in the process of monitoring a fault of each monitored object in the cloud network, the operation state feature corresponding to each monitored object may be automatically extracted, and the monitoring basis adapted to each monitored object may be determined based on the operation state feature, so that the corresponding monitoring basis may be dynamically and adaptively updated according to the continuous change of the operation state of each monitored object, and the continuously updated monitoring basis may be better adapted to the complex and variable operation process of the cloud network, thereby effectively improving the accuracy of monitoring a fault of each monitored object in the cloud network.

The following describes a fault monitoring scheme provided in this embodiment by taking fault monitoring on a cluster a in a cloud network as an example.

According to the traditional scheme, a worker can set a fixed flow threshold value for the cluster A, and when the fact that the actual flow on the cluster A exceeds the flow threshold value is monitored, the cluster A is considered to have a fault. However, in practical application, fault misjudgment often occurs, and manpower and material resources are wasted.

According to the scheme provided by this embodiment, the operation state data of the cluster a, including but not limited to traffic trend data, change information, the number of users, the traffic of a single user, etc., may be periodically obtained, and based on these operation state data, the operation state feature corresponding to the cluster a may be generated. As previously mentioned, the operating condition characteristics generated for cluster a may include, but are not limited to, traffic stability characteristics, change state characteristics, user distribution characteristics, individual user traffic state characteristics, traffic fluctuation rates, transmission delay characteristics, and future traffic prediction characteristics, among others. The operating state feature may comprehensively reflect the current actual operating state of cluster a from a variety of dimensions.

The operating state characteristics of cluster a may then be input into the monitoring-basis determination model. The monitoring basis determination model can be used for outputting the monitoring basis according to the running state characteristics.

For example, based on the operating state characteristics of the cluster a, the monitoring basis determination model may output 2 monitoring thresholds, namely a traffic threshold and a user number threshold, and hit 2 monitoring strategies, namely a traffic mutation perception strategy and a statistical analysis strategy. The cluster a may be monitored for faults according to the monitoring basis determined by the monitoring basis determining model. For example, whether the flow of the target monitoring object has a sudden change or not may be determined based on a flow sudden change sensing strategy and a flow threshold, and if so, whether the flow sudden change of the target monitoring object is caused by a user number increase or not may be determined based on a statistical analysis strategy and a user number threshold, and if not, it may be determined that the target monitoring object has a fault.

Compared with the traditional scheme, in the scheme provided by the embodiment, the flow threshold value is dynamically determined according to the actual running state of the cluster A, so that a more accurate judgment basis can be provided for fault monitoring; moreover, in the scheme provided in this embodiment, it is not determined blindly that the cluster a has a fault when the traffic sudden change occurs in the cluster a, but it is continuously determined whether the traffic sudden change in the cluster a is caused by a non-fault factor, and the fault occurring in the cluster a is determined after other non-fault factors are eliminated, which can effectively reduce the fault misjudgment on the cluster a.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 103 may be device a; for another example, the execution subject of

steps

101 and 102 may be device a, and the execution subject of step 803 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Fig. 3 is a schematic structural diagram of a fault monitoring system according to another exemplary embodiment of the present application. Referring to fig. 3, the fault monitoring system may include:

the feature extraction module 10 is configured to extract an operation state feature corresponding to a target monitoring object in the cloud network in response to the monitoring basis update instruction;

the determining module 20 is configured to input the operating state characteristics corresponding to the target monitoring object into the monitoring basis determining model; in the monitoring basis determination model, determining a target monitoring basis adapted to the running state characteristics corresponding to the target monitoring object based on the mapping relation between the running state characteristics and the monitoring basis;

and the processing module 30 is configured to perform fault monitoring on the target monitoring object according to the target monitoring basis.

Referring to fig. 3, the feature extraction module 10 may include a training unit 11, where the training unit 11 may be configured to train the monitoring basis determination model 21 and each machine learning model mentioned in the foregoing method embodiments for extracting the operating state features, so as to continuously optimize each model. In addition, the feature extraction module 10 may interface with a change platform, and the change platform is configured to record change information of each monitoring object in the cloud network, so that the feature extraction module 10 may directly obtain the change information of the target monitoring object from the change platform to extract the change state feature of the target monitoring object.

In addition, referring to fig. 3, a policy center 40 may be further included in the fault monitoring system, at least one monitoring policy may be included in the policy center 40, and the logic contents of different monitoring policies may be pre-configured in the policy center. After the determination module 20 determines the target monitoring basis for the target monitoring object, the processing module 30 may update the monitoring basis corresponding to the target monitoring object to the target monitoring basis.

In this embodiment, the target monitoring criteria includes at least one target monitoring parameter and/or at least one monitoring policy. The monitoring parameters include one or more of a traffic monitoring threshold, a transmission delay time threshold, a number of users threshold, a traffic fluctuation rate threshold, and an abnormal state duration threshold. The operational status features include one or more of a traffic stability feature, a change status feature, a user distribution feature, a single user traffic status feature, a traffic fluctuation rate, a transmission delay feature, and a future traffic prediction feature.

In an optional embodiment, referring to fig. 3, the fault monitoring system may further include an execution module 50, configured to perform fault monitoring on the target monitoring object according to a target monitoring basis in response to a fault monitoring instruction for the target monitoring object. In the process of monitoring the target monitoring object for faults, the execution module 50 may invoke one or more monitoring strategies related to the target monitoring basis from the policy center 40.

The execution module 50 may perform fault monitoring under multiple monitoring strategies respectively when the target monitoring basis includes multiple monitoring strategies in the process of performing fault monitoring on the target monitored object according to the target monitoring basis; and if the fault results monitored under the multiple monitoring strategies meet the preset requirements, determining that the target monitoring object has a fault.

In an optional embodiment, the execution module 50 is specifically configured to determine that the target monitoring object fails if it is determined that the target monitoring object meets the corresponding failure criterion under a plurality of monitoring strategies.

In an optional embodiment, the execution module 50 may be configured to, during the fault monitoring process under a plurality of monitoring strategies:

if the first monitoring strategy is a monitoring strategy based on monitoring parameters, determining first monitoring parameters corresponding to the first monitoring strategy from at least one target monitoring parameter contained in the target monitoring basis;

under a first monitoring strategy, fault monitoring is carried out on a target monitoring object based on a first monitoring parameter;

the first monitoring strategy is any one of at least one monitoring strategy contained in the target monitoring basis.

In an optional embodiment, the training unit 11 may obtain a plurality of monitoring samples during the process of training the monitoring basis determination model, where the monitoring samples include the operating state features corresponding to the monitoring object and the adopted monitoring basis; and inputting the running state characteristics contained in the monitoring samples and the adopted monitoring basis into a monitoring basis determination model so as to enable the monitoring basis determination model to learn the mapping relation between the running state characteristics and the monitoring basis.

In an optional embodiment, the feature extraction module 10 may obtain the operation state data of the target monitoring object in the process of extracting the operation state feature corresponding to the target monitoring object in the cloud network; determining target data which has influence on target characteristics in the running state data; inputting the target data into a specific machine learning model for extracting target features; extracting target features based on the target data by using a specific machine learning model; the target feature is any one of the features included in the operating state feature.

In addition, referring to fig. 3, in this embodiment, the fault monitoring system may further include a replay module 60, configured to store monitoring samples corresponding to historical monitoring processes, and, in a case where the model parameter update of the monitoring basis determination model occurs, replay verification may be performed on a current update of the monitoring basis determination model using a previous monitoring sample (corresponding to the above-mentioned historical monitoring process), that is, if a prediction result of the monitoring basis prediction performed on the previous monitoring sample using the model parameter of the current update in the monitoring basis determination model is consistent with a prediction result included in the previous monitoring sample, the monitoring basis determination model is allowed to be updated at the current time, otherwise, the monitoring basis determination model is not allowed to be updated at the current time. This may ensure stability of the monitoring according to the determination model.

Of course, in this embodiment, the fault monitoring system may further include other modules, and is not limited to the above-mentioned several exemplary modules, which are not exhaustive here.

It should be noted that, in the present embodiment, reference may be made to the related description in the foregoing method embodiment for functional logic of each component included in the fault monitoring system, and details are not repeated here, but this should not cause a loss of the protection scope of the present application.

Fig. 4 is a schematic structural diagram of a computing device according to another exemplary embodiment of the present application. Referring to fig. 4, the computing device may include a memory 41 and a processor 42.

A processor 42, coupled to the memory 41, for executing the computer program in the memory 41 to:

In an optional embodiment, the target monitoring criteria comprises at least one target monitoring parameter and/or at least one monitoring strategy.

In an optional embodiment, after updating the monitoring basis corresponding to the target monitoring object to the target monitoring basis, the processor 42 further includes:

and responding to a fault monitoring instruction aiming at the target monitoring object, and carrying out fault monitoring on the target monitoring object according to a target monitoring basis.

In an optional embodiment, when performing fault monitoring on the target monitoring object according to the target monitoring basis, the processor 42 is configured to:

under the condition that the target monitoring basis comprises a plurality of monitoring strategies, fault monitoring is carried out under the plurality of monitoring strategies respectively;

and if the fault results monitored under the multiple monitoring strategies meet the preset requirements, determining that the target monitoring object has a fault.

In an optional embodiment, if the monitored failure results under the multiple monitoring strategies meet a preset requirement, the processor 42, when determining that the target monitoring object fails, is configured to:

and if the target monitoring object is determined to meet the corresponding fault standard under the multiple monitoring strategies, determining that the target monitoring object has a fault.

In an alternative embodiment, the processor 42 is configured to, when performing fault monitoring under a plurality of monitoring strategies:

In an alternative embodiment, the monitoring parameters include one or more of a traffic monitoring threshold, a transmission delay time threshold, a number of users threshold, a traffic fluctuation rate threshold, and an abnormal state duration threshold.

In an alternative embodiment, the processor 42, in training for the monitoring basis determination model, is configured to:

In an optional embodiment, when extracting the operating state feature corresponding to the target monitoring object in the cloud network, the processor 42 is configured to:

acquiring running state data of a target monitoring object;

determining target data which has influence on target characteristics in the running state data;

inputting the target data into a specific machine learning model for extracting target features;

extracting target features based on the target data by using a specific machine learning model;

the target feature is any one of the features included in the operating state feature.

In an alternative embodiment, the operational status features include one or more of a flow stability feature, an alteration status feature, a user distribution feature, a single user flow status feature, a flow fluctuation rate, a transmission delay feature, and a future flow prediction feature.

Further, as shown in fig. 4, the computing device further includes: communication components 43, power components 44, and the like. Only some of the components are schematically shown in fig. 4, and the computing device is not meant to include only the components shown in fig. 4.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computing device in the foregoing method embodiments when executed.

Accordingly, embodiments of the present application also provide a computer program product comprising a computer program/instructions, wherein the computer program, when executed by a processor, causes the processor to implement the steps of the aforementioned fault monitoring method. The computer program product may be monitoring software for fault monitoring or other application software that integrates fault monitoring capabilities.

The memory of FIG. 4, described above, is used to store a computer program and may be configured to store other various data to support operations on a computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The communication component in fig. 4 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply assembly of fig. 4 described above provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A fault monitoring method, comprising:

2. The method according to claim 1, wherein the target monitoring criteria comprises at least one target monitoring parameter and/or at least one monitoring strategy.

3. The method according to claim 2, wherein after the monitoring basis corresponding to the target monitoring object is updated to the target monitoring basis, the method further comprises:

and responding to a fault monitoring instruction aiming at the target monitoring object, and carrying out fault monitoring on the target monitoring object according to the target monitoring basis.

4. The method of claim 3, wherein the fault monitoring of the target monitoring object according to the target monitoring basis comprises:

and if the monitored fault results under the multiple monitoring strategies meet the preset requirements, determining that the target monitoring object has a fault.

5. The method according to claim 4, wherein determining that the target monitoring object has a fault if the fault result monitored by the monitoring strategies meets a preset requirement comprises:

and if the target monitoring object is determined to meet the corresponding fault standard under the multiple monitoring strategies, determining that the target monitoring object fails.

6. The method of claim 4, wherein said fault monitoring under said plurality of monitoring strategies, respectively, comprises:

under the first monitoring strategy, fault monitoring is carried out on the target monitoring object based on the first monitoring parameter;

7. The method of claim 2, wherein the monitoring parameters include one or more of a traffic monitoring threshold, a transmission delay time threshold, a number of users threshold, a traffic fluctuation rate threshold, and an abnormal state duration threshold.

8. The method of claim 1, wherein the training process for the monitoring of the dependency determination model comprises:

acquiring a plurality of monitoring samples, wherein the monitoring samples comprise running state characteristics corresponding to monitoring objects and adopted monitoring bases, and the monitoring samples are from a historical monitoring process occurring before the monitoring bases determine the model to be used and a newly added monitoring process occurring during the model to be used;

and inputting the running state characteristics and the adopted monitoring basis contained in the monitoring samples into the monitoring basis determination model so that the monitoring basis determination model can learn the mapping relation between the running state characteristics and the monitoring basis.

9. The method according to claim 1, wherein the extracting of the operating state feature corresponding to the target monitoring object in the cloud network comprises:

acquiring running state data of the target monitoring object;

determining target data which has influence on target characteristics in the operation state data;

inputting the target data into a specific machine learning model for extracting the target features;

extracting the target feature based on the target data using the particular machine learning model;

wherein the target feature is any one of the features included in the operating state feature.

10. The method of claim 1, wherein the operational status characteristics include one or more of a traffic stability characteristic, a change status characteristic, a user distribution characteristic, a single user traffic status characteristic, a traffic fluctuation rate, a transmission delay characteristic, and a future traffic prediction characteristic.

11. The method of claim 1, wherein the target monitoring object comprises a cluster, component, node, server, or user instance.

12. A computing device comprising a memory and a processor;

the memory is to store one or more computer instructions;

13. A computer-readable storage medium storing computer instructions, which when executed by one or more processors, cause the one or more processors to perform the fault monitoring method of any one of claims 1-11.

14. A computer program product comprising computer programs/instructions, wherein the computer programs, when executed by a processor, cause the processor to carry out the steps of the fault monitoring method according to any one of claims 1-11.