CN115599579A

CN115599579A - System fault prediction method, device, equipment and medium based on weighted loss

Info

Publication number: CN115599579A
Application number: CN202211225563.1A
Authority: CN
Inventors: 王雨农; 夏坤; 邓凌飞; 马旭华
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-01-13

Abstract

The embodiment of the application provides a system fault prediction method, a device, equipment and a medium based on weighting loss, wherein the method comprises the following steps: acquiring a fault prediction model and real-time system data acquired in real time; the fault prediction model is generated based on target loss function training, and the target loss function is obtained based on weighted cross entropy loss and confidence coefficient calculation of the fault prediction model; generating an abnormal sequence according to real-time system data acquired in real time; and obtaining a fault prediction result aiming at the cloud computing system according to the abnormal sequence and the fault prediction model. By adopting the loss function based on the weighted cross entropy loss calculation, the weighted cross entropy loss is generated based on the prediction of a plurality of samples collected in the life cycle of the fault downtime unit, the weight of the mixed samples can be reduced, the difference of different training and testing objects can be filled based on the loss, the negative influence of the mixed samples on the model is reduced, and the accuracy of the model prediction is improved.

Description

System fault prediction method, device, equipment and medium based on weighted loss

Technical Field

The present disclosure relates to the field of cloud computing technologies, and in particular, to a method and an apparatus for predicting system failure based on weighted loss, an electronic device and a computer-readable storage medium.

Background

The cloud computing system can provide cloud computing services, meanwhile, the computing servers are subjected to centralized management, operation and maintenance, upgrading and use, the effects of improving the utilization efficiency of computing resources and reducing the use cost of enterprises are achieved, for the servers managed by the cloud computing system, abnormal logs generated by the cloud computing system are collected and utilized to predict the server faults and timely perform operation and maintenance on the faults, and the stability of the cloud computing system can be improved.

However, the number of servers included in the cloud computing system is large, the abnormal amount generated by the servers is large, and due to the fact that the time of occurrence of key abnormal models of different downtime types is different, the labels printed on the data by adopting the unified standard may be wrong, so that a large number of confusing samples appear in a training set, and the accuracy of model prediction is affected.

Disclosure of Invention

In view of the above problems, embodiments of the present application are provided to provide a method for weighted loss based system failure prediction, an apparatus for weighted loss based system failure prediction, a corresponding electronic device, and a corresponding computer readable storage medium, which overcome or at least partially solve the above problems.

The embodiment of the application discloses a system fault prediction method based on weighting loss, which is applied to a cloud computing system and comprises the following steps:

acquiring a fault prediction model and real-time system data acquired in real time; the fault prediction model is generated based on target loss function training, and the target loss function is obtained based on weighted cross entropy loss and confidence coefficient calculation of the fault prediction model;

generating an abnormal sequence according to the real-time system data acquired in real time;

and obtaining a fault prediction result aiming at the cloud computing system according to the abnormal sequence and the fault prediction model.

Optionally, the real-time system data collected in real time includes an abnormal log reported by each fault downtime unit in the cloud computing system for system abnormality; the generating of the abnormal sequence according to the real-time system data collected in real time comprises:

acquiring a reporting sequence of the abnormal logs reported by each fault downtime unit, and mapping each fault downtime unit to the abnormal logs reported by the system abnormally to obtain abnormal events, wherein each abnormal event has a corresponding numerical identifier in an abnormal event library, and each abnormal event has a corresponding abnormal grade;

the numerical identifiers corresponding to the mapped abnormal events and the abnormal grades corresponding to the abnormal events are combined and mapped to obtain abnormal numerical identifiers aiming at each abnormal log;

forming the abnormal numerical identifiers of the abnormal numerical identifiers aiming at each abnormal log into an abnormal identification sequence according to the reverse sequence of the reporting sequence;

and sampling the abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence.

Optionally, the abnormal identification sequence includes a downtime sample and a non-downtime sample, where the downtime sample is a numerical identification sample corresponding to a system downtime occurring in the cloud computing system, and the non-downtime sample is a numerical identification sample corresponding to another system abnormality occurring in the cloud computing system except for the system downtime;

sampling the abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence, wherein the abnormal sequence comprises the following steps:

determining a plurality of sampling positions on the abnormal identification sequence according to the preset time interval;

acquiring a target sampling position corresponding to the downtime sample from the plurality of sampling positions, and taking the target sampling position as a sampling starting position;

and reserving the downtime samples starting from the sampling initial position within the length of a preset sampling window, and starting from the sampling initial position to randomly reserve the non-downtime samples on the abnormal identification sequence according to a preset proportion to obtain an abnormal sequence aiming at the system downtime.

Optionally, the obtaining a fault prediction result for the system according to the abnormal sequence and the fault prediction model includes:

inputting the abnormal sequence as an input item to the fault prediction model, and outputting to obtain a prediction confidence coefficient;

if the prediction confidence is greater than a preset threshold, determining that the fault prediction result aiming at the system is a downtime result;

and/or if the prediction confidence is smaller than or equal to the preset threshold, determining that the predicted fault result aiming at the system is a non-downtime result.

Optionally, the fault prediction model has a classifier module for instructing the fault prediction module to determine a fault prediction result of the cloud computing system; the fault prediction model is generated by:

acquiring an offline data set, wherein the offline data set comprises a sample abnormal sequence obtained through data preprocessing, the sample abnormal sequence is used as an input item for training the fault prediction model, and the sample abnormal sequence is generated based on abnormal events obtained by mapping abnormal logs reported by fault downtime aiming at system abnormalities in a cloud computing system and on abnormal grades corresponding to the mapped training abnormal events;

initializing a preset matrix randomly to obtain a model matrix, and encoding the sample abnormal sequence by adopting the model matrix to obtain an encoding vector;

acquiring an initialized weight matrix, and acquiring an output sequence according to the coding vector and the weight matrix;

and acquiring parameters of the classifier module, and generating a fault prediction model according to the output sequence and the parameters of the classifier module.

Optionally, each numerical identification sample corresponding to a system anomaly is included in the sample anomaly sequence; the encoding of the sample abnormal sequence by using the model matrix to obtain an encoding vector comprises:

obtaining each numerical identification sample corresponding to each system abnormality in the abnormality sequence, and mapping and converting each numerical identification sample corresponding to each abnormality in the abnormality sequence into a corresponding feature vector;

determining the gradient of the loss of the eigenvector by adopting a preset loss function, and updating the corresponding position of the model matrix by reversely transmitting the gradient of the loss of the eigenvector;

and coding the sample abnormal sequence by adopting the updated model matrix to obtain a coding vector of each numerical identification sample corresponding to each system abnormality in the sample abnormal sequence.

Optionally, the fault prediction model has a multi-layer linear complexity attention module, and the time-aware attention module is configured to instruct the fault prediction model to perform system fault detection based on weighted loss; the obtaining an output sequence according to the coding vector and the weight matrix includes:

taking the coding vector as an input item of a first layer of the linear complexity attention module, and multiplying the input item by the weight matrix to obtain an output item of the first layer;

starting from the second layer of the linear complexity attention module, taking the sum of the input item of the previous layer and the output item of the previous layer as the input item of the next layer, and multiplying the input item of the next layer by the weight matrix until the sum of the input item of the previous layer and the output item of the previous layer is multiplied by the weight matrix to obtain the output item of the top layer;

and forming the feature vectors of the output items at the top layer into an output sequence.

Optionally, the generating a fault prediction model according to the output sequence and the parameters of the classifier module includes:

acquiring a feature vector positioned at the first position in the output sequence, inputting the acquired target feature vector into the classifier module for classification, and determining weighted cross entropy loss and confidence coefficient for training the fault prediction model;

and generating a target loss function according to the weighted cross entropy loss and the confidence coefficient of the fault prediction model, and training by adopting the target loss function to obtain the fault prediction model.

Optionally, the inputting the obtained target feature vector into the classifier module for classification, and determining a weighted cross entropy loss and a confidence for training the fault prediction model includes:

inputting the obtained target characteristic vectors into the classifier module for classification to obtain a confidence coefficient that the classifier module classifies each sample corresponding to each characteristic vector in the output sequence as a downtime sample;

determining a fault downtime unit to which each sample corresponding to each feature vector in the output sequence belongs;

when samples of the same fault downtime units exist in subsequent output sequences, calculating a downtime weight aiming at the fault downtime unit on the basis of the confidence coefficient recorded by the same fault downtime unit, and increasing the weight of the samples of the same fault downtime unit to the downtime weight;

and performing weighted loss calculation by adopting the downtime weight added by the samples with the same fault downtime units and the samples with the same fault downtime units, determining weighted cross entropy loss, and updating the confidence coefficient recorded by the same fault downtime units to be the maximum confidence coefficient output by classifying the trained samples into downtime samples under the whole fault downtime units.

Optionally, the generating a target loss function according to the weighted cross entropy loss and the confidence of the fault prediction model, and obtaining the fault prediction model by training the target loss function includes:

and reversely transmitting the gradient of the weighted cross entropy loss through the target loss function, updating the parameters of the fault prediction model, and generating the fault prediction model.

The embodiment of the application also discloses a system failure prediction device based on weighting loss, which is applied to a cloud computing system, and the device comprises:

the abnormal sequence generating module is used for acquiring real-time system data acquired in real time and generating an abnormal sequence according to the real-time system data acquired in real time;

the fault prediction model acquisition module is used for acquiring a fault prediction model; the fault prediction model is generated based on target loss function training, and the target loss function is obtained based on weighted cross entropy loss and confidence coefficient calculation of the fault prediction model;

and the fault prediction module is used for obtaining a fault prediction result aiming at the cloud computing system according to the abnormal sequence and the fault prediction model.

Optionally, the real-time system data collected in real time includes an abnormal log reported by each fault downtime unit in the cloud computing system for system abnormality; the abnormal sequence generation module 501 may include the following sub-modules:

the abnormal event library construction submodule is used for acquiring the reporting sequence of the abnormal logs reported by each fault downtime unit and mapping the abnormal logs reported by each fault downtime unit aiming at the system abnormity to obtain abnormal events, wherein each abnormal event has a corresponding numerical identifier in the abnormal event library, and each abnormal event has a corresponding abnormal grade;

the abnormal numerical identifier mapping submodule is used for carrying out combined mapping on the numerical identifier corresponding to the mapped abnormal event and the abnormal grade corresponding to the abnormal event to obtain an abnormal numerical identifier aiming at each abnormal log;

the sequence composition submodule is used for performing composition of the abnormal identification sequence aiming at the abnormal numerical value identification of each abnormal log according to the reverse order of the report order;

and the abnormal sequence generation submodule is used for sampling the abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence.

In an embodiment of the present application, the abnormal identification sequences include downtime samples and non-downtime samples, where the downtime samples are numerical identification samples corresponding to system downtime occurring in the cloud computing system, and the non-downtime samples are numerical identification samples corresponding to other system abnormalities occurring in the cloud computing system except for the system downtime; the abnormal sequence generation submodule may include the following units:

the sampling point determining unit is used for determining a plurality of sampling positions on the abnormal identification sequence according to the preset time interval;

a sampling initial position determining unit, configured to obtain a target sampling position corresponding to the downtime sample from the multiple sampling positions, and use the target sampling position as a sampling initial position;

and the abnormal sequence generating unit is used for reserving the downtime samples starting from the sampling initial positions within the length of a preset sampling window, and randomly reserving the non-downtime samples on the abnormal identification sequence according to a preset proportion starting from the sampling initial positions to obtain an abnormal sequence aiming at system downtime.

Optionally, the failure prediction module comprises:

the confidence coefficient output submodule is used for inputting the abnormal sequence as an input item to the fault prediction model and outputting to obtain a prediction confidence coefficient;

the first fault prediction submodule is used for determining the fault prediction result aiming at the system as a downtime result when the prediction confidence coefficient is greater than a preset threshold value;

and the second fault prediction sub-module is used for determining that the predicted fault result aiming at the system is a non-downtime result when the prediction confidence coefficient is smaller than or equal to the preset threshold value.

Optionally, the fault prediction model has a classifier module for instructing the fault prediction module to determine a fault prediction result of the cloud computing system; the device further comprises:

and the fault prediction model generation module is used for generating a fault prediction model based on the weighted loss.

Optionally, the fault prediction model generation module includes:

the off-line data set acquisition submodule is used for acquiring an off-line data set, the off-line data set comprises a sample abnormal sequence obtained through data preprocessing, the sample abnormal sequence is used as an input item for training the fault prediction model, and the sample abnormal sequence is generated on the basis of abnormal events obtained by mapping abnormal logs reported by fault downtime aiming at system abnormality in the cloud computing system and on the basis of abnormal grades corresponding to the mapped training abnormal events;

the encoding submodule is used for initializing a preset matrix at random to obtain a model matrix and encoding the sample abnormal sequence by adopting the model matrix to obtain an encoding vector;

the output sequence generation submodule is used for acquiring an initialized weight matrix and obtaining an output sequence according to the coding vector and the weight matrix;

and the fault prediction model generation submodule is used for acquiring the parameters of the classifier module and generating a fault prediction model according to the output sequence and the parameters of the classifier module.

Optionally, each numerical identification sample corresponding to a system anomaly is included in the sample anomaly sequence; the encoding submodule includes:

a feature vector conversion unit, configured to obtain each numerical identifier sample corresponding to each system anomaly in the anomaly sequence, and map and convert each numerical identifier sample corresponding to each system anomaly in the anomaly sequence into a corresponding feature vector;

the matrix updating unit is used for determining the loss gradient of the characteristic vector by adopting a preset loss function and reversely transmitting the loss gradient of the characteristic vector to update the corresponding position of the model matrix;

and the coding vector output unit is used for coding the sample abnormal sequence by adopting the updated model matrix to obtain a coding vector of each numerical identification sample corresponding to each system abnormality in the sample abnormal sequence.

Optionally, the fault prediction model has a multi-layer linear complexity attention module, and the time-aware attention module is configured to instruct the fault prediction model to perform system fault detection based on weighted loss; the output sequence generation submodule comprises:

the output item generating unit is used for taking the coding vector as an input item of a first layer of the linear complexity attention module and multiplying the input item by the weight matrix to obtain an output item of the first layer;

the iteration calculation unit is used for starting from the second layer of the linear complexity attention module, using the sum of the input item of the previous layer and the output item of the previous layer as the input item of the next layer, and multiplying the input item of the next layer by the weight matrix until the sum of the input item of the previous layer and the output item of the previous layer is multiplied by the weight matrix to obtain the output item of the top layer;

and the output sequence composition unit is used for composing the feature vectors of the output items at the top layer into an output sequence.

Optionally, the fault prediction model generation sub-module includes:

the weighted cross entropy loss determining unit is used for acquiring the feature vector positioned at the first position in the output sequence, inputting the acquired target feature vector into the classifier module for classification, and determining weighted cross entropy loss and confidence coefficient for training the fault prediction model;

and the fault prediction model generation unit is used for generating a target loss function according to the weighted cross entropy loss and the confidence coefficient of the fault prediction model, and obtaining the fault prediction model by adopting the target loss function for training.

Optionally, the weighted cross entropy loss determining unit includes:

the confidence determining subunit is configured to input the obtained target feature vectors to the classifier module for classification, so as to obtain a confidence that the classifier module classifies each sample corresponding to each feature vector in the output sequence as a downtime sample;

the failure downtime unit determining subunit is used for determining the failure downtime unit to which each sample corresponding to each feature vector in the output sequence belongs;

the weight setting subunit is used for calculating the downtime weight aiming at the downtime of the system of the fault downtime unit based on the confidence coefficient recorded by the same fault downtime unit when the same sample of the fault downtime unit exists in the subsequent output sequence, and increasing the weight of the sample of the same fault downtime unit to the downtime weight;

and the confidence coefficient updating subunit is used for performing weighted loss calculation by adopting the downtime weights added by the samples of the same fault downtime unit and the samples of the same fault downtime unit to determine weighted cross entropy loss, and updating the confidence coefficient recorded by the same fault downtime unit to be the maximum confidence coefficient output by classifying the trained samples as the downtime samples under the whole fault downtime unit.

Optionally, the failure prediction model generation unit includes:

and the fault prediction model generation subunit is used for reversely transmitting the gradient of the weighted cross entropy loss through the target loss function, updating the parameters of the fault prediction model and generating the fault prediction model.

The embodiment of the application also discloses an electronic device, which comprises: a processor, a memory, and a computer program stored on the memory and capable of running on the processor, the computer program when executed by the processor implementing any of the weighted loss based system failure prediction methods.

The embodiment of the application also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the computer program realizes any one of the weighted loss based system failure prediction methods.

The embodiment of the application has the following advantages:

in the embodiment of the application, when fault prediction is performed on real-time system data acquired in real time, an abnormal sequence can be generated on the real-time system data acquired in real time, then fault prediction is performed on the abnormal sequence by using an acquired fault prediction model to obtain a fault prediction result, wherein the adopted fault prediction model can be generated based on target loss function training, the target loss function training can be obtained based on weighted cross entropy loss and confidence degree calculation of the fault prediction model, the adopted loss function based on weighted cross entropy loss calculation is generated based on prediction of a plurality of samples acquired in a life cycle of a fault downtime unit, the weight of a confusing sample can be reduced, different differences of training and test objects can be filled based on the loss, negative effects of the confusing sample on the model can be reduced, and the accuracy of model prediction can be improved.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of a weighted loss based system fault prediction method of the present application;

FIG. 2 is a flow chart of steps in another embodiment of a method for weighted loss based system fault prediction according to the present application;

FIG. 3 is a schematic diagram of a training framework of a fault prediction model provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of an application scenario of weighted loss based system failure prediction according to an embodiment of the present application;

fig. 5 is a block diagram of an embodiment of a weighted loss based system failure prediction apparatus according to the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

To facilitate an understanding of the application by those skilled in the art, the following may be interpreted to refer to terms or terms used in the following examples of the application:

attention: the attention mechanism can be used for putting more attention resources on the target area with important attention so as to obtain more detail information of the target needing attention and suppress other useless information, and the attention mechanism can be used as an interface in the embodiment of the application to provide the information of the correlation among the elements in the mining sequence.

Transformer: the Attention-based machine learning model is commonly used for natural language processing and image processing tasks.

BERT: bidirectional Encoder transformations from transformations, a bidirectionally encoded transform model, is commonly used for natural language processing and image processing tasks.

XgBoost: extreme Gradient Boosting, a commonly used ensemble learning algorithm.

NCLAT: NC loss and Linear Attention based Transformer, the embodiment of the application proposes a Transformer algorithm of an Attention mechanism based on NC weighting loss and Linear computation complexity.

NC: the Node Controller refers to a single unit for counting the failure downtime in the cloud computing system.

GELU: the Gaussian Error Linear Unit refers to an activation function used in the BERT model, and the activation function introduces the idea of random regularization.

And (3) ECS: elastic computer Service, cloud server.

LogBERT model: the embodiment of the application shows that on the basis of the BERT model, the machine learning fault prediction model based on the statistical characteristics.

The XgBoost model: in the embodiment of the application, a common ensemble learning algorithm is adopted to realize a model for predicting the machine learning fault based on statistical characteristics.

FFN: feed Forward Network, forward propagation Network.

The NC downtime is one of important factors influencing the stability of the cloud computing system, unexpected sudden downtime causes serious loss to users, abnormal log record information of existing downtime NC is utilized to learn the abnormal output mode of the downtime NC, so that whether the current NC is crashed in the next period of time can be inferred according to the abnormal log record of the monitoring side, that is, the downtime of the server can be predicted in advance through model learning the reporting mode of the system logs of the existing fault NC, and whether the downtime fault is operated in advance is determined based on the prediction result so as to realize the insensiveness of the downtime of the users, thereby optimizing the user experience.

However, the existing downtime prediction algorithm has a very high operation cost when a long sequence is input, the requirement of online real-time prediction cannot be met, and because the number of servers included in the cloud computing system is large, the abnormal quantity generated by the servers is large, the time of the key abnormal models of different downtime types is different, the labels printed on the data by adopting a uniform standard may be wrong, so that a large number of confusion samples appear in a training set.

In the related downtime prediction algorithm, the abnormal output mode of the downtime NC is learned by using the abnormal log record information of the existing downtime NC through a LogBERT model and/or an XgBoost model, and whether the current NC is crashed in the next period of time or not is deduced according to the abnormal log record of the monitoring side.

Specifically, the LogBERT model processes the log text in a natural language processing mode, specifically analyzes the log text according to a specified format, then adopts the BERT model as a feature extractor, and uses a large amount of log text data to construct an auxiliary learning task in advance for training the BERT model to learn the composition logic of the log text, and then uses the model obtained after training for learning and testing the fault prediction task. However, the LogBERT model has a pre-training process for constructing an auxiliary learning task by using a large amount of log texts in advance, the pre-training process increases the time consumption of model training, and the complexity of the used model is also increased for the complex process of log text coding, so that the model volume is too large and the efficiency is low due to the high complexity of the used model, and the predicted accuracy of the model trained in a sample confusion scene is low under the condition that a confusion sample is not considered in the model training process.

The XgBoost model is mainly used for processing a fault prediction problem by utilizing statistical characteristics, specifically counting the occurrence frequency of each abnormal log in a fixed time window in each NC as the statistical characteristics of the NC, sending the characteristics into an XgBoost classifier for training, and carrying out online test on the trained model. However, the statistical features extracted by the XgBoost model include frequency information of occurrence of the abnormalities, do not include information such as occurrence sequence, do not consider interrelation between the abnormalities, and cannot learn a complex abnormal combination pattern, the learned abnormal pattern is limited, and like the BERT model, only cross entropy or square loss is sampled as a loss function, the trained model cannot distinguish confusing samples, and a large number of confusing samples in a data set will cause negative effects on the model, so that the accuracy of the model in fault prediction is not high.

From the above, the LogBERT model and the XgBoost model perform the fault prediction task by using the abnormal log of the system, but in the fault learning process, neither the LogBERT model nor the XgBoost model considers the condition of the aliasing sample.

One of the core ideas of the embodiment of the application is to adopt a linear Attention calculation mode to reduce the calculation complexity of the model, and also adopt an NC-level loss function to fill up the difference between training and test objects by designing NC-level loss, reduce the negative influence of a mixed sample on the model and improve the accuracy of a system fault prediction result. Specifically, the linear complexity Attention is used for replacing a common Attention mechanism in BERT, the model calculation complexity is reduced, the model efficiency and stability are improved, and more data can be processed simultaneously in online prediction; and an NC-level loss function is adopted, because a plurality of samples collected in an NC life cycle are used in training and the final prediction object is NC, the NC-level loss is designed to fill different differences of the training and testing objects, and the negative influence of the mixed samples on the model is reduced. Furthermore, the log text is abstracted, a corresponding abnormal event library is constructed to encode the log, the log text is constructed based on the independent abnormal event library so as to avoid the complex process of pre-training a BERT model to the log text encoding, the calculation complexity of the model is reduced, the model obtained by training is smaller, the training and testing speed is higher, and the efficiency is higher.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a system failure prediction method based on weighted loss according to the present application is shown, and is applied to a cloud computing system, and focuses on a use process of a failure prediction model, and specifically may include the following steps:

step 101, acquiring a fault prediction model and real-time system data acquired in real time;

in the embodiment of the application, a linear Attention calculation mode can be adopted to reduce the calculation complexity of the model, an NC-level loss function is also adopted, the negative influence of a confusion sample on the model is reduced by designing NC-level loss to fill up different differences of training and test objects, and the accuracy of a system fault prediction result is improved.

In order to reduce the negative influence of the confusion samples on the model by designing different differences of NC-level loss filling training and test objects in the fault prediction process of the adopted fault prediction model, the method can be specifically realized based on a loss function of weighted cross entropy loss calculation, the weight of the confusion samples can be reduced by predicting and generating the weighted cross entropy loss of the model for a plurality of samples collected in the life cycle of the fault downtime unit, and the obtained fault prediction model can be a model generated by training the target loss function calculated based on the weighted cross entropy loss and the confidence coefficient of the fault prediction model. It should be noted that, compared to the cross-entropy loss commonly used in training, the weighted cross-entropy loss can reduce the weight of the aliasing samples, and greatly increase the accuracy of the model.

In the specific use process of the fault prediction model, after the fault prediction model generated by training of the target loss function calculated by obtaining the weighted cross entropy loss and the confidence coefficient of the fault prediction model is obtained, an abnormal sequence can be obtained at a certain moment, an integer sequence after the abnormal event and the corresponding abnormal grade are mapped is obtained for the abnormal event sequence, and the integer sequence is sent to the trained fault prediction model for prediction.

102, generating an abnormal sequence according to real-time system data acquired in real time;

in practical application, an abnormal sequence is obtained at a certain moment, and specifically, the abnormal sequence can be generated based on real-time system data collected in real time, and an integer sequence used for being sent into a trained fault prediction model for prediction can be represented as the abnormal sequence.

Specifically, the real-time system data collected in real time may include abnormal logs reported by each fault failure downtime unit in the cloud computing system for system abnormality, at this time, log texts included in the abnormal logs for system abnormality may be generally abstracted, and the logs are encoded based on a pre-constructed abnormal event library to obtain abnormal events for daily logs. The method comprises the steps of constructing corresponding abnormal events based on abstraction processing, carrying out corresponding numerical identification coding operation on the constructed abnormal events, and realizing the pre-construction of an abnormal event library aiming at each abnormal event, so that the subsequent data preprocessing on abnormal logs can be carried out based on the construction of an independent abnormal event library, namely, the quantity of a model is reduced based on the coding of the abnormal logs of system abnormity, the abnormal events and the numerical identification, the training test speed is accelerated, and the complex process of pre-training log text coding by a BERT model is avoided. The method is characterized in that the data preprocessing process is carried out on the acquired fault downtime units for counting the fault downtime, namely abnormal logs reported by each NC,

in an embodiment of the present invention, each fault downtime unit may map an abnormal event to an abnormal log reported by a system abnormally, and then may acquire a numerical identifier corresponding to each mapped abnormal event from a pre-constructed abnormal event library, where the numerical identifier corresponding to the abnormal event may be generally expressed as an integer corresponding to the abnormal event, and at this time, after a reporting sequence of the abnormal log reported by each fault downtime unit is acquired, the numerical identifiers corresponding to the mapped abnormal events, that is, the corresponding integers are sorted to form an abnormal identifier sequence, so as to perform sampling according to a preset time interval and a preset sampling window length based on the formed abnormal identifier sequence, thereby obtaining the abnormal sequence.

In practical application, in the mapping process of the abnormal event, the reported abnormal log can be mapped through the regular expression to obtain the abnormal event, the abnormal event is represented as the abnormal log reported by the NC which is obtained through the detector in real time, various abnormal logs are abstracted into a phrase in one-to-one correspondence according to expert knowledge and the regular expression, and the abstracted phrase can be called the abnormal event. Specifically, for each log, keywords common to the log and other logs, such as Error, hardware and other words, may be determined, and then the keywords may be divided into different exception names according to the types of the keywords, so as to obtain a phrase for naming the exception event, for example, abstracting the original log text "mce: [ Hardware Error ]: machine check events locked" into the exception event "dmesg _ unreover _ mc"; meanwhile, the mapped exceptional events may be divided into corresponding exceptional levels, for example, the exceptional levels are divided into 5: normal, low _ warning, critical, and facial, then the last exception event in the case log is "dmesg _ nonrever _ mce-critical". Because the abnormal log reported on line is a statement for describing the state of the computer and contains a lot of redundant information, the key information in the abnormal log can be extracted through abstracting the log text, and meanwhile, the abnormal log reported on line can be selectively screened, for example, the log with Error field is selected for analysis.

It should be noted that the abstract exception event is often a unified description of a class of exception logs; the abnormal level analysis rule usually changes according to different abnormal event types, and the specific level is generally determined according to the severity of the event, so as to further subdivide the abnormal event and increase detailed information, thereby being beneficial to the prediction of downtime.

After the abnormal event is obtained through mapping, corresponding conversion of the numerical identifier can be performed based on a pre-constructed abnormal event library, wherein the pre-constructed abnormal event library is similar to a corpus in natural language processing and comprises various abnormal event types, each abnormal event can correspond to one numerical identifier, namely an integer, and each abnormal event has a corresponding abnormal grade, because the abnormal event in the abnormal event library is presented in a phrase form, direct training of a fault prediction model is not facilitated, the fault prediction model is input in a vector form generally, at this time, each abnormal event can be converted into a number according to the abnormal event library for training, for example, the abnormal event "dmsg _ nonrever _ mc" is converted into 32, and the homological abnormal grade is also mapped into an integer from 1 to 5 respectively, for example, "dmsg _ nonrever _ mc-critical" is converted into 32 to 4. It should be noted that, for better distinction, the same anomalies of different levels may be regarded as different independent anomalies, specifically, considering that there are 5 levels of the anomalous events, and assuming that an index of the anomalous event in the dictionary is i and a corresponding level thereof is j, the anomalous event may be "i-j", and may be specifically mapped to i × 5+ j. For example, a mapping constructed for the exception "exception-level," may map "32-4" to an integer 164.

After the integer and the abnormal grade corresponding to the abnormal event in the abnormal event library are mapped and generated, in order to ensure that the fault prediction model can see the system abnormality occurring in the latest time period, and meanwhile, because the length of the sequence allowed to be input by the fault prediction model is usually limited, the abnormality with a closer priority time can be ensured under the condition of cutting off the long sequence, the method is specifically represented as combining the integer corresponding to the abnormal event and the abnormal grade corresponding to the abnormal event to obtain the abnormal numerical identifier aiming at each abnormal log, and forming the abnormal identification sequence according to the reverse order of the reporting order of the abnormal logs; and then, sampling the formed abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence, and obtaining an abnormal time stamp sequence according to the abnormal sequence. Specifically, the method can be represented by arranging the numbers converted by the exceptions reported by each NC in a reverse order to form an exception identification sequence.

The system abnormality occurring in the cloud computing system may include a downtime sample (that is, "downtime"/"downtime" often speaking) or other faults, and the exception identification sequence generated based on the reported exception log may include a downtime sample and a non-downtime sample, where the downtime sample may refer to a numerical identification sample corresponding to system downtime occurring in the cloud computing system, and the non-downtime sample may refer to a numerical identification sample corresponding to other system abnormalities occurring in the cloud computing system except for system downtime.

At this time, a plurality of sampling positions may be determined on the formed abnormal identification sequence according to a preset time interval, then a target sampling position corresponding to the downtime sample, that is, the target sampling position at the downtime time, is obtained from the plurality of sampling positions, the target sampling position is used as a sampling start position, then the downtime sample starting from the sampling start position within the length of a preset sampling window is retained, and non-downtime samples on the formed abnormal identification sequence are randomly retained according to a preset proportion starting from the sampling start position, so as to obtain an abnormal sequence aiming at system downtime.

For example, after the anomaly identification sequence is formed in the reverse order, sampling may be performed at a preset time interval, for example, 5 minutes as a time interval, and at a preset sampling window length, for example, 3 days as a sampling length, with overlap in the entire sequence, as in the case where a sampling position point of 5 minutes is assumed, and each sampling position contains a system anomaly within 3 days before the sampling time, since the time interval between two adjacent sampling positions is much smaller than the preset sampling window length, the sampling time of the sampling position B adjacent to the sampling position a is included in the current sampling time, for example, in the sequence of 12 days before (i.e., 72 hours), and the sampling time of the sampling position B adjacent to the sampling position a is acquired in 12:08 the sequence in the 5-minute time period is that the sampling position A cannot be acquired, and the sequence of the sampling position B and the sampling position A in the time period of 23 hours and 55 minutes which may exist in 2 days except for the 5 minutes is overlapped, so that the correlation preservation of the newly generated abnormality and the abnormality existing in the front edge is ensured, the data of the sampling starting position in the time period of 3 days at the downtime can be preserved for the downtime sample (positive sample), a certain proportion of samples are randomly preserved for the normal sample (negative sample) in the whole sequence, and an abnormality sequence with the shape of "[ start ],32, 91, \ 8230 \\ 8230, 256' is obtained, the random preservation of the negative sample is used for preserving the diversity of the training samples, and the down-sampling is used for reducing the class imbalance proportion. Wherein [ start ] is an initiator, and has no specific meaning.

It should be noted that, negative samples refer to samples collected in a lifecycle without a downtime NC, where the samples in this case are negative, and positive samples refer to samples collected within 3 days after a downtime NC starts to preset a sampling window length at a downtime time, and at this time, abnormal events are only obtained after system logs are abstracted, and the occurrence of the abnormal events does not mean downtime.

And 103, obtaining a fault prediction result aiming at the cloud computing system according to the abnormal sequence and the fault prediction model.

After the generated abnormal sequences are sent to the trained fault prediction model, namely the abnormal sequences are input to the fault prediction model as input items, the trained fault prediction model can perform fault prediction on the input real-time system data, predict whether the system fails or not, and determine whether operation and maintenance are performed on the downtime fault in advance based on the prediction result so as to realize the non-sensibility of the user on the downtime, thereby optimizing the user experience.

In practical application, an NC downtime fault is one of important factors influencing the stability of the cloud computing system, unexpected sudden downtime will cause serious loss to users, in order to maintain the stability of the cloud computing system, whether the current NC is crashed in the next period of time or not can be inferred by the fault prediction model based on input real-time system data, and a fault prediction result is output.

In a specific implementation, the failure prediction result may be determined based on the output prediction confidence, and if the prediction confidence is greater than a preset threshold, the failure prediction result for the system is determined to be a downtime result, and/or if the prediction confidence is less than or equal to the preset threshold, the failure prediction result for the system is determined to be a non-downtime result. The downtime result refers to that the NC is judged to have a downtime risk in a future period of time, the downtime fault can be operated and maintained in advance to realize the noninductivity of the downtime of a user, the non-downtime result refers to that the NC does not have the downtime risk in the future period of time, and no operation is performed temporarily at this time.

Referring to fig. 2, a flowchart illustrating steps of another embodiment of the weighted loss based system failure prediction method according to the present application is shown, and is applied to a cloud computing system, and focuses on a generation/training process of a failure prediction model, and specifically may include the following steps:

step 201, obtaining an offline data set, wherein the offline data set comprises a sample abnormal sequence obtained through data preprocessing;

the method for predicting the system fault comprises the following steps of designing a model, calculating the Attention mechanism of the model, and designing a linear Attention calculation mode according to the Attention mechanism of the model, wherein the Attention mechanism comprises a linear Attention calculation mode, a linear Attention calculation mode and an NC-level loss function. In practical applications, the use of the Transformer algorithm of the attention mechanism based on the NC weighted loss and the linear computation complexity can be realized based on the execution of the fault prediction model, and at this time, the fault prediction model can be generated/trained, so that the generated fault prediction model can have the capability of executing the Transformer algorithm of the attention mechanism based on the NC weighted loss and the linear computation complexity.

Specifically, referring to fig. 3, a schematic diagram of a training framework of a fault prediction model provided in the embodiment of the present application is shown, and specifically, the weight of a confusion sample can be reduced based on weighted cross entropy loss, so that the accuracy of the model is greatly increased, meanwhile, the log text can be abstracted, a corresponding abnormal event library is constructed to encode the log, the method is constructed based on an individual abnormal event library to avoid a complex process of pre-training a BERT model to encode the log text, and the computational complexity of the model is reduced.

In this embodiment of the application, the offline data set used for training the fault prediction model may include a sample exception sequence obtained through data preprocessing, where the sample exception sequence may be generated based on an exception event generated by mapping the reported exception log and an exception level corresponding to the exception event.

In an embodiment of the present invention, the sample exception sequence may be implemented by mainly resampling according to a preset time interval and a preset sampling window length, and specifically, the exception log may be mapped to obtain an exception event, and then numerical identifiers, i.e., corresponding integers, corresponding to the exception event in a pre-constructed exception event library are sorted according to a reverse order of a reporting order of the reported exception log to form an exception identifier sequence, so as to sample the formed exception identifier sequence according to the preset time interval and the preset sampling window length, so as to obtain the exception sequence.

In practical application, in the mapping process of the abnormal event, the reported abnormal log can be mapped to obtain the abnormal event through the regular expression, the abnormal event is represented by acquiring the abnormal log reported by the NC in real time through the detector, various abnormal logs are abstracted into a phrase in a one-to-one correspondence mode according to expert knowledge and the regular expression, and the abstracted phrase can be called the abnormal event. Specifically, for each log, keywords common to the log and other logs, such as Error, hardware and other words, may be determined, and then the keywords may be divided into different exception names according to the types of the keywords, so as to obtain a phrase for naming the exception event, for example, abstracting the original log text "mce: [ Hardware Error ]: machine check events locked" into the exception event "dmesg _ unreover _ mc"; meanwhile, the mapped exception events may be classified into corresponding exception levels, for example, into 5: normal, low _ warning, critical, and facial, then the last exception event in the case log is "dmesg _ nonrever _ mce-critical". Because the abnormal log reported on line is a statement for describing the state of the computer and contains a lot of redundant information, the key information in the abnormal log can be extracted by abstracting the log text, and meanwhile, the abnormal log reported on line can be selectively screened, for example, the log with Error field is selected for analysis.

It should be noted that the abstract exception event is often a unified description of a class of exception logs; the abnormal level analysis rule usually changes according to different abnormal event types, and the specific level is generally determined according to the severity of the event, so as to further subdivide the abnormal event and increase detailed information, which is beneficial to the breakdown prediction.

After the abnormal event is obtained through mapping, corresponding numerical identification conversion can be performed based on a pre-constructed abnormal event library, wherein the pre-constructed abnormal event library is similar to a corpus in natural language processing and comprises various abnormal event types, each abnormal event can correspond to one numerical identification, namely an integer, and each abnormal event has a corresponding abnormal grade, because the abnormal event in the abnormal event library is presented in a phrase form and is not beneficial to directly training a fault prediction model, the fault prediction model is generally input in a vector form, each abnormal event can be converted into a number according to the abnormal event library for training, for example, the abnormal event "dmesg _ nonreveder _ mc" is converted into 32, and the same abnormal grade is also mapped into an integer from 1 to 5 respectively, for example, "dmesg _ nonreveder _ mc-critical" is converted into 32 to 4. It should be noted that, for better distinction, the same anomalies of different levels may be regarded as different independent anomalies, specifically, considering that there are 5 levels of the anomalous events, assuming that the index of the anomalous event in the dictionary is i, and the corresponding level is j, then the anomalous event may be "i-j", and may be specifically mapped to i + 5+ j. For example, a mapping constructed for the exception "exception-level" may map "32-4" to an integer 164.

After the integer and the abnormal grade corresponding to the abnormal event in the abnormal event library are mapped and generated, in order to ensure that the fault prediction model can see the system abnormality occurring in the latest time period, and meanwhile, because the length of the sequence allowed to be input by the fault prediction model is usually limited, the abnormality with a closer priority time can be ensured under the condition of cutting off the long sequence, the method is specifically represented as combining the integer corresponding to the abnormal event and the abnormal grade corresponding to the abnormal event to obtain the abnormal numerical identifier aiming at each abnormal log, and forming the abnormal identification sequence according to the reverse order of the reporting order of the abnormal logs; and then, sampling the formed abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence, and obtaining an abnormal time stamp sequence according to the abnormal sequence. Specifically, the method can be represented by arranging the numbers converted by the abnormality reported by each NC in a reverse order to form an abnormality identification sequence.

The system abnormality occurring in the cloud computing system may include a downtime (that is, frequently, "downtime"/"downtime"), or may include other faults, and then the abnormality identification sequence generated based on the reported abnormal log may include a downtime sample and a non-downtime sample, where the downtime sample may refer to a numerical identification sample corresponding to the system downtime occurring in the cloud computing system, and the non-downtime sample may refer to a numerical identification sample corresponding to other system abnormality occurring in the cloud computing system except for the system downtime.

At this time, a plurality of sampling positions can be determined on the formed abnormal identification sequence according to a preset time interval, then a target sampling position corresponding to the downtime sample, namely the target sampling position at the downtime moment, is obtained from the plurality of sampling positions, the target sampling position is used as a sampling starting position, then the downtime sample starting from the sampling starting position within the length of a preset sampling window is reserved, and the non-downtime sample on the formed abnormal identification sequence is randomly reserved according to a preset proportion starting from the sampling starting position, so that an abnormal sequence aiming at system downtime is obtained.

For example, after the anomaly identification sequence is composed in the reverse order, sampling may be performed in an overlapping manner over the entire sequence according to a preset time interval, for example, 5 minutes, with a preset sampling window length, for example, 3 days as a sampling length, and assuming that 5 minutes is a sampling position point, each sampling position includes a system anomaly within 3 days before the sampling time, since the time interval between two adjacent sampling positions is much smaller than the preset sampling window length, the sampling time of the sampling position B adjacent to the sampling position a when the sequence acquired at the sampling position a includes a current sampling time, for example, in the sequence of 12 days before (i.e., 72 hours), the sampling time acquired at the sampling position B is 12-12 when the sequence of 12 days before (i.e., 72 hours): 08 the sequence in the 5-minute period is that the sampling position a cannot be acquired, and the sequence of the sampling position B and the sampling position a in the period of 2 days, 23 hours and 55 minutes except the 5 minutes is overlapped, so as to ensure that the newly-generated abnormality and the abnormality existing before are subjected to relevance preservation, the data of the sampling starting position in the downtime moment within 3 days can be preserved for the downtime sample (positive sample), a certain proportion of samples are randomly preserved for the normal sample (negative sample) in the whole sequence, and an abnormality sequence with the shape of "[ start ],32, 91, \ 8230 \ 8230 ″,256 ″ is obtained, the random preservation for the negative sample is to preserve the diversity of the training samples, and the down-sampling is to reduce the class imbalance proportion. Wherein [ start ] is an initiator, and has no specific meaning.

202, initializing a preset matrix randomly to obtain a model matrix, and encoding the sample abnormal sequence by using the model matrix to obtain an encoding vector;

in an embodiment of the present invention, after a sample exception sequence included in the offline data set is obtained, each numerical identification sample corresponding to each system exception in the sample exception sequence may be obtained, each numerical identification sample corresponding to each system exception in the sample exception sequence is mapped and converted into a feature vector, then a gradient of loss of the feature vector is determined by using a preset loss function, and a corresponding position of a model matrix obtained after initialization is updated by reversely transmitting the gradient of loss of the feature vector, so that the sample exception sequence serving as an input item is encoded by using the updated model matrix, a coding vector for each numerical identification sample corresponding to each system exception in the sample exception sequence is obtained, and an output sequence for training a fault prediction model is obtained subsequently based on the generated coding vector.

In practical application, the generation process of the coding vector can be realized by input coding (Embedding) as shown in fig. 3, specifically, a matrix E can be initialized randomly before the fault prediction model is trained, the row index in the sample abnormal sequence can represent the type of the abnormal event, D is the dimension of the mapped feature vector, which is used for representing the dimension after coding and is the size of the second dimension of the matrix E; when the fault prediction model is trained, mapping and converting each numerical identification sample in an input sample abnormal sequence into a characteristic vector to be sent to the model as model input, in the training of the fault prediction model, reversely transmitting the loss gradient corresponding to system abnormality to the corresponding position of a matrix E to be updated, wherein the matrix E is a parameter of the model, and the concrete expression is that the gradient of the loss function relative to E is reversely transmitted at the moment, so that the parameter of E can be updated, and the aim of training the model is fulfilled; after training of the prediction model is completed, the matrix E is used as a part of model parameters, and test input may be encoded, where the input may be the integer sequence, so as to obtain an encoding vector corresponding to an event according to the matrix E and the mapped integer, that is, an abnormal sequence serving as an input item is encoded by using the updated model matrix, and an encoding vector for each abnormality in the abnormal sequence is obtained.

Step 203, acquiring an initialized weight matrix, and obtaining an output sequence according to the coding vector and the weight matrix;

the process of deriving an output sequence for training the fault prediction model based on the generated code vectors may be represented as deriving the output sequence from the code vectors and the weight matrix.

In practical applications, it can be implemented by a Linear Attention Block (Linear Attention Block) as shown in fig. 3. As shown in fig. 3, the model is composed of a plurality of layers of Linear Attention blocks connected in series, wherein the output of the n-1 layer is used as the input of the n layer, so for each layer of Linear Attention Block, the coded vector can be used as the input item of the first layer of the Linear complexity Attention Block, multiplied by the weight matrix to obtain the output item of the first layer, then from the second layer to the L layer, the sum of the input item of the previous layer and the output item of the previous layer is used as the input item of the next layer, multiplied by the weight matrix to obtain the output item of the next layer, and the sum of the input item of the previous layer and the output item of the previous layer is multiplied by the weight matrix to obtain the output item of the top layer. The output type of the top-layer linear complexity attention module, namely the last-layer linear complexity attention module, in the fault prediction model is expressed in a matrix form, the matrix form is equivalent to a special vector, namely the special vector can be converted into a feature vector for identification, and the feature vectors of all row vectors in an output item can be combined into an output sequence.

In particular, assuming that the input sequence is a code vector, the obtained weight matrix can be represented by three weight matrices initialized at random: w ^Q ,W ^k ,W ^v The following calculation is performed:

Q＝XW ^Q

α＝Softmax(Q)

X'＝FFN(K⊙GELU(XW ^V )+X)

wherein, W ^Q For calculating the weight of a certain coded vector in the input sequence, X may refer to an input, e.g. starting from the second layer of the linear complexity attention module, as the sum of the input of the previous layer and the output of the previous layer, is a matrix (comprising a plurality of vectors) which may be multiplied by W based on X ^Q Obtaining Q, wherein Q can refer to the product of the weight and the input item, and taking the value of the product as the dynamic weight; a may refer to the calculation of a normalized exponential function Softmax for a matrix Q containing a plurality of vectors, and vectorizing the matrix Q to obtain corresponding feature vectors, where a may be formed by ai, that is, a _i ＝[a ₁ ,a ₂ ,…a _L ]Where L may represent the linear complexity attention module of the failure prediction model, i.e., the number of layers of Block, and a _i Different layer number eigenvectors can be represented; w ^k For accepting the calculated dynamic weights and calculating column weights, W, of the respective feature vectors ^v For accepting the calculated column weights and outputting to the next layer; FFN is used to map vectors composed of eigenvalues of each dimension of all abnormal vectors in a sample, softmax and GELU respectively represent two different activation functions, which represent "matrix and vector element product" (the product of vector K and each row of the matrix is made with corresponding elements), and X' can be used to represent the current Block, i.e., the output item of the current linear complexity attention module.

And step 204, generating a fault prediction model according to the output sequence and the parameters of the classifier module.

In an embodiment of the application, the target feature vectors of the output items at the top layer, that is, the last layer, may be arranged in a sequence manner to form an output sequence, at this time, the target feature vector located at the first position in the output sequence, that is, the feature vector corresponding to the start character, may be obtained and input to the classifier module for classification, the weighted cross entropy loss and the confidence coefficient for training the fault prediction model are determined, then, a target loss function may be generated according to the weighted cross entropy loss and the confidence coefficient of the fault prediction model, and the fault prediction model is obtained by training using the target loss function.

In practical application, a feature vector (class token) corresponding to the start symbol of the output sequence of the last layer of TAABlock may be used as a representative of the sequence, and is sent to a Classifier (Classifier) shown in fig. 3 for classification, where the start symbol does not correspond to any anomaly, but the correlation between each anomaly and the start symbol is calculated when the correlation matrix is calculated, and the vector of the start symbol is weighted based on the correlation, so that the start symbol may be regarded as a symbol independent from each anomaly but representative of a combination of all anomalies, and therefore the feature vector corresponding to the start symbol is used for classification.

Compared with the cross entropy loss commonly used in training, the weighted cross entropy loss can reduce the weight of the confusion sample and greatly increase the accuracy of the model.

Specifically, the obtained target feature vectors may be input to the classifier module for classification, the obtained classifier module classifies samples corresponding to the feature vectors in the output sequence as confidence levels of downtime samples, the failure downtime units of the samples corresponding to the feature vectors in the output sequence are determined, when the same failure downtime unit sample exists in the subsequent output sequence, the downtime weight for the system downtime occurring on the failure downtime unit is calculated based on the confidence level recorded by the same failure downtime unit, the weight of the sample of the same failure downtime unit is added to the downtime weight, at this time, the weighted loss calculation may be performed by using the downtime weight added by the sample of the same failure downtime unit and the sample of the same failure unit, the weighted cross entropy loss is determined, and the confidence level recorded by the same failure downtime unit may be updated to be the maximum confidence level output by the sample trained to the downtime sample under the entire failure unit.

Illustratively, the weighted cross entropy loss is embodied as NC-level loss for training of the classifier, and is embodied as recording, for each input sample, an NC to which each sample belongs, and recording a confidence P of the model for classifying the sample as a positive sample; when the same NC samples exist in the subsequent training, the weight W can be set for the currently input samples according to the P values recorded by the current same NC, the weighted cross entropy loss calculation is carried out, meanwhile, the P values recorded by the current same NC can be updated to be the maximum confidence coefficient of the output of the training samples classified as positive samples under the NC, the weighted loss calculation for the P recorded by each NC and the samples is realized, the weight of the confusing samples can be reduced for the prediction generation of a plurality of samples collected in the life cycle of the fault downtime unit, the different differences of the training and testing objects can be filled based on the loss, and the negative influence of the confusing samples on the model can be reduced.

In practical application, according to P _max The process of calculating the weight W may be as follows:

where gamma is a hyperparameter, P _max The maximum confidence coefficient of the samples of the corresponding NC records, which are predicted to be positive samples by the model, is the output confidence coefficient of the current samples, which are predicted to be positive samples by the model.

In a preferred embodiment, P is performed after calculation as W _max So that the weight multiplied by the loss function of the corresponding sample is used for model updating during training to improve the downtime prediction capability of the model. Specifically, the maximum confidence P may be updated and attenuated in a manner as follows according to the confidence P that the current sample is predicted as a positive sample by the model _max ：

P _max ←max(P _max ，P)

P _max ←λP _max ,λ∈(0，1)

Wherein, when the maximum confidence coefficient is updated, the maximum confidence coefficient can be expressed as max (P) _max P) update the confidence level P to P _max (ii) a When attenuating the maximum confidence, it can be shown that P is multiplied by a factor λ _max Attenuation is carried out, in particular λ ranges from 0 to 1, i.e. usually a fraction, in fractions with P _max The product of (a) achieves the goal of attenuating the maximum confidence. It should be noted that, for specific values of the multiple λ, the embodiment of the present application is not limited to this.

In the final classification, a preset threshold value may be set, when the prediction confidence coefficient output by the model is greater than the preset threshold value, the fault prediction result may be determined as a downtime result, and when the output prediction confidence coefficient is less than or equal to the preset threshold value, the fault prediction result may be determined as a non-downtime result.

The above-described update procedure for the maximum confidence level is present in the training procedure, and the update procedure for the confidence level is not required in the prediction procedure using the failure prediction model generated by training. And because the model may change in each iteration in the training process, P can be continuously updated _max In the same way, the model is changedIf the Pmax recorded before may have become small, there is also a fading situation.

At this time, parameters of the fault prediction model can be updated by reversely transmitting the gradient of the weighted cross entropy loss through the target loss function, and the fault prediction model is generated.

In the embodiment of the application, a linear Attention calculation mode is adopted to reduce the calculation complexity of the model, an NC-level loss function is also adopted, the NC-level loss is designed to fill up different differences of training and test objects, the negative influence of a confusion sample on the model is reduced, and the accuracy of a system fault prediction result is improved. Specifically, the linear complexity Attention is used for replacing a common Attention mechanism in BERT, the model calculation complexity is reduced, the model efficiency and stability are improved, and more data can be processed simultaneously in online prediction; and an NC-level loss function is adopted, because a plurality of samples collected in an NC life cycle are used in training and the final prediction object is NC, the NC-level loss is designed to fill different differences of the training and testing objects, and the negative influence of the mixed samples on the model is reduced. Furthermore, the log text is abstracted, a corresponding abnormal event library is constructed to encode the log, the log text is constructed based on the independent abnormal event library so as to avoid the complex process of pre-training a BERT model to the log text encoding, the calculation complexity of the model is reduced, the model obtained by training is smaller, the training and testing speed is higher, and the efficiency is higher.

Referring to fig. 4, a schematic diagram of an application scenario of weighted loss based system failure prediction provided in an embodiment of the present application is shown, where the weighted loss based system failure prediction may be applied to a cloud computing system, and relates to a cloud computing platform 410 and a cloud computing server cluster 411.

In practical applications, the cloud computing platform 410 may pre-process the offline data set collected from the cloud computing server cluster 411, so as to serve as a training set training model and store the abnormal logs included in the offline data set, according to the steps of mapping abnormal events to abnormal logs, converting the abnormal events into numerical identifiers, converting abnormal grades into numerical identifiers, constructing an abnormal event library, forming abnormal identifier sequences, and resampling according to a preset time interval and a preset sampling window length to generate abnormal sequences; the training/generation of the fault prediction model may then be performed according to the training framework schematic of the fault prediction model as shown in fig. 3, such that the generated fault prediction model is capable of having the capability of executing a Transformer algorithm based on the attention mechanism of NC weighted loss and linear computational complexity.

During fault prediction, the cloud computing platform 410 can obtain an abnormal event sequence at a certain moment, then execute a transform algorithm of an attention mechanism based on NC weighted loss and linear computation complexity based on the obtained fault prediction model, predict whether the system fails, and determine whether to operate and maintain the downtime fault in advance based on the prediction result so as to realize non-inductance of the user on the downtime, thereby optimizing user experience.

Specifically, the NC downtime is one of the important factors affecting the stability of the cloud computing system, an unexpected sudden downtime will cause a serious loss to a user, and in order to maintain the stability of the cloud computing system, it may be inferred by the failure prediction model based on the input real-time system data whether the current NC is down in the next period of time, and a failure prediction result is output. In a specific implementation, the failure prediction result may be determined based on the output prediction confidence, and if the prediction confidence is greater than a preset threshold, the failure prediction result for the system is determined to be a downtime result, and/or if the prediction confidence is less than or equal to the preset threshold, the failure prediction result for the system is determined to be a non-downtime result. The downtime result refers to that the NC is judged to have a downtime risk in a future period of time, the downtime fault can be operated and maintained in advance to realize the noninductivity of the downtime of a user, the non-downtime result refers to that the NC does not have the downtime risk in the future period of time, and no operation is performed temporarily at this time.

In the embodiment of the application, the NCLAT algorithm provided by the embodiment of the application can extract complex feature expression modes with low computation complexity, simultaneously reduces the negative influence of confusion samples on the model, realizes the efficient, accurate and comprehensive finding of the NC about to be down, avoids unnecessary loss through operation and maintenance, and greatly improves the stability of the cloud computing system.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those of skill in the art will recognize that the embodiments described in this specification are presently preferred embodiments and that no particular act is required to implement the embodiments of the disclosure.

Referring to fig. 5, a block diagram of a structure of an embodiment of the weighted loss based system failure prediction apparatus according to the present application is shown, and is applied to a cloud computing system, and specifically may include the following modules:

an abnormal sequence generating module 501, configured to obtain real-time system data acquired in real time, and generate an abnormal sequence according to the real-time system data acquired in real time;

a failure prediction model obtaining module 502, configured to obtain a failure prediction model; the fault prediction model is generated based on target loss function training, and the target loss function is obtained based on weighted cross entropy loss and confidence coefficient calculation of the fault prediction model;

and a failure prediction module 503, configured to obtain a failure prediction result for the cloud computing system according to the abnormal sequence and the failure prediction model.

In an embodiment of the present application, the real-time system data collected in real time includes an abnormal log reported by each fault downtime unit in the cloud computing system for a system abnormality; the abnormal sequence generation module 501 may include the following sub-modules:

the abnormal event library construction submodule is used for acquiring a reporting sequence of the abnormal logs reported by each fault downtime unit and mapping the abnormal logs reported by each fault downtime unit aiming at the system abnormity to obtain abnormal events, wherein each abnormal event has a corresponding numerical identifier in the abnormal event library, and each abnormal event has a corresponding abnormal grade;

In an embodiment of the present application, the abnormal identification sequences include downtime samples and non-downtime samples, where the downtime samples are numerical identification samples corresponding to system downtime occurring in the cloud computing system, and the non-downtime samples are numerical identification samples corresponding to other system abnormalities occurring in the cloud computing system except for the system downtime; the abnormal sequence generation submodule may include the following elements:

and the abnormal sequence generating unit is used for reserving the downtime samples within the length of a preset sampling window from the sampling initial position, and randomly reserving the non-downtime samples on the abnormal identification sequence according to a preset proportion from the sampling initial position to obtain an abnormal sequence aiming at the system downtime.

In one embodiment of the present application, the failure prediction module 503 may include the following sub-modules:

the confidence coefficient output sub-module is used for inputting the abnormal sequence serving as an input item to the fault prediction model and outputting to obtain a prediction confidence coefficient;

In one embodiment of the present application, the failure prediction model has a classifier module for instructing the failure prediction module to determine a failure prediction result of the cloud computing system; the apparatus may further include the following modules:

In one embodiment of the present application, the fault prediction model generation module may include the following sub-modules:

the coding submodule is used for initializing a preset matrix randomly to obtain a model matrix, and coding the sample abnormal sequence by adopting the model matrix to obtain a coding vector;

In an embodiment of the present application, the sample exception sequence includes each numerical identification sample corresponding to a system exception; the encoding submodule may include the following elements:

In one embodiment of the present application, the fault prediction model has a multi-layer linear complexity attention module, and the time-aware attention module is configured to instruct the fault prediction model to perform system fault detection based on a weighted loss; the output sequence generation submodule may include the following units:

In one embodiment of the present application, the failure prediction model generation sub-module may include the following units:

In one embodiment of the present application, the weighted cross-entropy loss determination unit may include the following sub-units:

and the confidence coefficient updating subunit is used for performing weighted loss calculation by adopting the downtime weight added by the samples with the same fault downtime units and the samples with the same fault downtime units, determining weighted cross entropy loss, and updating the confidence coefficient recorded by the same fault downtime units to be the maximum confidence coefficient output by classifying the trained samples into downtime samples under the whole fault downtime units.

In one embodiment of the present application, the failure prediction model generation unit may include the following sub-units:

The system fault prediction device based on the weighted loss provided by the embodiment of the application can generate an abnormal sequence from real-time system data collected in real time when the real-time system data collected in real time is subjected to fault prediction, then the abnormal sequence is subjected to fault prediction by adopting the obtained fault prediction model to obtain a fault prediction result, wherein the adopted fault prediction model can be generated based on target loss function training, the target loss function training can be obtained based on weighted cross entropy loss and confidence calculation of the fault prediction model, the adopted loss function based on the weighted cross entropy loss calculation is generated based on prediction of a plurality of samples collected in a life cycle of a fault downtime unit, the weight of a confusion sample can be reduced, different differences of test objects based on the loss training can be used, the negative influence of the confusion sample on the model can be reduced, and the accuracy of model prediction can be improved.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present application further provides an electronic device, including:

the system failure prediction method based on the weighting loss comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, when the computer program is executed by the processor, each process of the system failure prediction method based on the weighting loss is realized, the same technical effect can be achieved, and in order to avoid repetition, the description is omitted here.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the embodiment of the system fault prediction method based on weighting loss, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The weighted loss-based system fault prediction method, the weighted loss-based system fault prediction device, the corresponding electronic device and the corresponding computer-readable storage medium provided by the present application are described in detail above, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A system fault prediction method based on weighted loss is applied to a cloud computing system, and comprises the following steps:

2. The method according to claim 1, wherein the real-time system data collected in real time comprises an abnormal log reported by each fault downtime unit in the cloud computing system for system abnormality; the generating of the abnormal sequence according to the real-time system data collected in real time comprises:

forming an abnormal identification sequence by the abnormal numerical identification aiming at each abnormal log according to the reverse sequence of the reporting sequence;

3. The method according to claim 2, wherein the abnormal identification sequences comprise downtime samples and non-downtime samples, wherein the downtime samples are numerical identification samples corresponding to system downtime occurring on the cloud computing system, and the non-downtime samples are numerical identification samples corresponding to other system abnormalities occurring on the cloud computing system except for the system downtime;

sampling the abnormal identification sequence according to a preset time interval and a preset sampling window length to obtain an abnormal sequence, comprising:

4. The method of claim 1, wherein obtaining a fault prediction result for the system based on the abnormal sequence and the fault prediction model comprises:

if the prediction confidence is larger than a preset threshold, determining that the fault prediction result aiming at the system is a downtime result;

5. The method of claim 1 or 4, wherein the fault prediction model has a classifier module for instructing the fault prediction module to determine a fault prediction result for the cloud computing system; the fault prediction model is generated by the following method:

6. The method of claim 5, wherein the sample exception sequence includes numerical identification samples corresponding to system exceptions; the encoding of the sample abnormal sequence by using the model matrix to obtain an encoding vector comprises:

obtaining each numerical identification sample corresponding to each system abnormality in the abnormality sequence, and mapping and converting each numerical identification sample corresponding to each system abnormality in the abnormality sequence into a corresponding feature vector;

7. The method of claim 5, wherein the fault prediction model has a multi-layered linear complexity attention module for instructing the fault prediction model to perform system fault detection based on weighted loss; the obtaining an output sequence according to the coding vector and the weight matrix includes:

8. The method of claim 5, wherein generating a fault prediction model based on the output sequence and the parameters of the classifier module comprises:

9. The method of claim 8, wherein inputting the obtained target feature vectors into the classifier module for classification, and determining a weighted cross-entropy loss and a confidence level for training the fault prediction model comprises:

inputting the obtained target feature vectors into the classifier module for classification, and obtaining the confidence coefficient that the classifier module classifies each sample corresponding to each feature vector in the output sequence as a downtime sample;

10. The method of claim 8, wherein the generating an objective loss function according to the weighted cross entropy loss and the confidence of the fault prediction model, and training with the objective loss function to obtain the fault prediction model comprises:

11. A system failure prediction device based on weighting loss is applied to a cloud computing system, and comprises the following components:

the abnormal sequence generation module is used for acquiring real-time system data acquired in real time and generating an abnormal sequence according to the real-time system data acquired in real time;

12. An electronic device, comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the weighted loss based system failure prediction method of any of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the weighted loss based system failure prediction method according to any one of claims 1 to 10.