CN115686916A

CN115686916A - Intelligent operation and maintenance method and device

Info

Publication number: CN115686916A
Application number: CN202211379266.2A
Authority: CN
Inventors: 徐林嘉; 陈李龙; 袁如怡; 李睿琦
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-02-03

Abstract

The application provides an intelligent operation and maintenance method and device, which can be used in the financial field or other fields. The method comprises the following steps: acquiring real-time operation data acquired by an operation and maintenance system; analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result; and obtaining an operation and maintenance strategy according to the fault detection result, and performing corresponding operation and maintenance processing according to the operation and maintenance strategy. The problem of unbalanced samples is solved through establishing the fault detection model, so that the fault detection model has a better detection effect on abnormal samples with extremely small quantity ratio, intelligent fault early warning and detection are carried out, and meanwhile, the fault influence range is small, and even abnormity is monitored and alarmed before a fault occurs, so that operation and maintenance work is carried out more smoothly, operation and maintenance manpower and material resources are reduced, and operation and maintenance effect and efficiency are improved.

Description

Intelligent operation and maintenance method and device

Technical Field

The present application relates to the field of system operation and maintenance technologies, and in particular, to an intelligent operation and maintenance method and apparatus.

Background

At present, IT operation and maintenance management is one of the most popular topics in the information field at present. With the continuous deepening and perfecting of IT construction, the operation and maintenance of computer hardware and software systems become a problem which is generally concerned and overwhelmed by information service departments in various industries. With the rapid increase of data volume, the continuous rise of the number of devices and the increasing complexity of software and hardware systems. Most of the early operation and maintenance work is manually completed by operation and maintenance personnel, which is called manual operation and maintenance or human meat operation and maintenance; the backward production mode is difficult to maintain in the era of rapid expansion of internet business and high labor cost. Meanwhile, the traditional operation and maintenance mode brings more and more heavy working pressure to operation and maintenance engineers, and the large-scale enterprises have more and more large requirements on the operation and maintenance manpower and material resources.

Generally, the fault detection function in an AIOPS (intelligent Intelligence for IT Operations) scenario is usually a classification task. Whereas in a data set of a classification task, if the number of training examples from different classes is very different, the data set is considered to be class-unbalanced. The imbalance ratio of the data may vary from task to task, ranging from less than ten to thousands of different. The classification tasks based on such data are collectively referred to as sample imbalance classification.

In many practical tasks, the data is often large in scale and contains a lot of noise, and the number of samples of different classes varies greatly. The number of minority class samples is only a small proportion compared to majority class samples. Take a real task as an example: in a click-through rate prediction (click-through rate prediction) task, a new sample is generated for each advertisement presented to a user, and whether the user finally clicks on the label of the sample is determined. In practice, only a few users will click on the embedded advertisement in the web page, which results in a large difference between the number of positive/negative examples in the finally obtained training data set. The same situation occurs in many practical application scenarios, such as financial fraud detection (normal/fraud), medical-assisted diagnosis (normal/sick), network intrusion detection (normal/attack connection), and so on.

The fault detection (normal/fault) in the AIOPS scene is also a scene with large difference between the number of different types of samples, the number of times of fault occurrence in the actual operation and maintenance scene is less than that of normal operation, and the difference between the number of positive/negative samples in the obtained training data set is large; the model is concerned more about the prediction accuracy of the whole sample in the training process, and the two aspects also finally cause that the detection effect of the trained model on a small amount of samples is not ideal to a certain extent.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the application mainly aims to provide an intelligent operation and maintenance method and device, so that accurate intelligent operation and maintenance are realized, and the operation and maintenance effect and efficiency are improved.

In order to achieve the above object, the present application provides an intelligent operation and maintenance method, including: acquiring real-time operation data acquired by an operation and maintenance system; analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result; and obtaining an operation and maintenance strategy according to the fault detection result, and performing corresponding operation and maintenance processing according to the operation and maintenance strategy.

The application provides an intelligent operation and maintenance method, optionally, the method further includes: obtaining historical system operation data, and preprocessing the historical system operation data to generate training sample data; sampling the training sample data through a preset loss function to obtain an effective sample with balanced positive and negative samples; and constructing the fault detection model through a tree type algorithm according to the loss function and the effective samples.

Optionally, the obtaining of an effective sample with balanced positive and negative samples by sampling the training sample data through a preset loss function includes: defining a class balance term between the sampling non-weight and the reverse class frequency weight through a preset loss function; and sampling the training sample data according to the class balance item to obtain an effective sample with balanced positive and negative samples.

Optionally, constructing the fault detection model according to the loss function and the effective sample through a tree type algorithm includes: determining the number of effective samples corresponding to different types of samples according to the number of samples in the effective samples; updating parameters of a preset loss function according to the number of effective samples corresponding to different types of samples; and constructing and obtaining the fault detection model through a tree type algorithm according to the effective samples in the training sample data and the loss function after the parameters are updated.

Optionally, the obtaining the fault detection model through tree type algorithm construction according to the effective samples in the training sample data and the loss function after parameter update includes: determining a loss function negative gradient value corresponding to each effective sample by using a lifting tree type algorithm and a loss function after parameter updating; performing sample traversal on the effective sample by using a lifting tree type algorithm and a loss function negative gradient value to obtain a constraint parameter; and adjusting an initial gradient lifting algorithm regression model according to the constraint parameters to construct the fault detection model.

The application provides an intelligent operation and maintenance method, optionally, performing sample traversal on the effective samples by using a lifting tree type algorithm and a loss function negative gradient value to obtain constraint parameters includes: performing sample traversal by using a lifting tree type algorithm and the negative gradient value of the loss function to obtain the minimum value of the loss function; and performing residual error fitting according to the minimum value of the loss function to obtain constraint parameters.

Optionally, the adjusting an initial gradient lifting algorithm regression model according to the constraint parameter to construct the fault detection model includes: training a regression model of the gradient lifting algorithm of the next round according to the constraint parameters calculated by the regression model of the gradient lifting algorithm of each round; and the initial gradient lifting algorithm regression model obtains the fault detection model through training of preset iteration times.

The application also provides an intelligence fortune dimension device, the device includes: the acquisition module is used for acquiring real-time operation data acquired by the operation and maintenance system; the analysis module is used for analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result; and the processing module is used for obtaining an operation and maintenance strategy according to the fault detection result and carrying out corresponding operation and maintenance processing according to the operation and maintenance strategy.

The application also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the method.

The present application also provides a computer-readable storage medium having stored thereon a computer program for executing the above method.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the above-described method.

The problem of unbalanced samples is solved through establishing the fault detection model, so that the fault detection model has a better detection effect on abnormal samples with extremely small quantity ratio, intelligent fault early warning and detection are carried out, and meanwhile, the fault influence range is small, and even abnormity is monitored and alarmed before a fault occurs, so that operation and maintenance work is carried out more smoothly, operation and maintenance manpower and material resources are reduced, and operation and maintenance effect and efficiency are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1A is a flowchart of an intelligent operation and maintenance method according to an embodiment of the present application;

FIG. 1B is a flowchart of fault detection model training provided by an embodiment of the present application;

FIG. 2 is a flow chart of efficient sample sampling in an embodiment of the present application;

FIG. 3 is a flow chart of the construction of a fault detection model in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an update flow of a loss function according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating the constraint parameter acquisition in an embodiment of the present application;

FIG. 6 is a flowchart illustrating an iterative training process of a fault detection model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an intelligent operation and maintenance device in an embodiment of the present application;

FIG. 8 is a schematic diagram of application logic of the intelligent operation and maintenance device in the embodiment of the present application;

fig. 9 is a schematic structural diagram of an intelligent operation and maintenance device in another embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides an intelligent operation and maintenance method and device, which can be used in the financial field and other fields, and it should be noted that the intelligent operation and maintenance method and device can be used in the financial field and any fields except the financial field.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1A, a flowchart of an intelligent operation and maintenance method according to an embodiment of the present application is shown, and an execution subject of the intelligent operation and maintenance method provided by the embodiment of the present application includes, but is not limited to, a computer. The problem of unbalanced samples is solved through establishing the fault detection model, so that the fault detection model has a better detection effect on abnormal samples with extremely small quantity ratio, intelligent fault early warning and detection are carried out, and meanwhile, the fault influence range is small, and even abnormity is monitored and alarmed before a fault occurs, so that operation and maintenance work is carried out more smoothly, operation and maintenance manpower and material resources are reduced, and operation and maintenance effect and efficiency are improved. The method shown in fig. 1A includes:

and S1, acquiring real-time operation data acquired by the operation and maintenance system.

As an embodiment of the present application, the real-time operational data and the historical system operational data for subsequent use may include: system success rate, CPU utilization rate, memory utilization rate, service response time, service time consumption, network rate and monitoring messages.

The method comprises the steps of collecting various historical system operation data in the IT system operation process, wherein the various historical system operation data comprise system level information (cpu utilization rate, memory utilization rate, network speed and the like) and application level information (transaction success rate, transaction response time and the like), and accumulated huge amount of original alarm information and corresponding alarm manual labeling information are used. And after collecting the historical system operating data, taking the historical system operating data as training sample data of the model.

As an embodiment of the present application, the method further comprises: preprocessing historical system operation data; wherein, the preprocessing comprises noise filtering, data cleaning and data analysis.

The method comprises the steps of preprocessing real-time operation data and historical system operation data, wherein the preprocessing comprises the data preprocessing processes of noise filtering, data cleaning, data correlation analysis, principal component analysis, PCA dimension reduction and the like.

And S2, analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result.

And S3, acquiring an operation and maintenance strategy according to the fault detection result, and performing corresponding operation and maintenance processing according to the operation and maintenance strategy.

The number of times of fault occurrence in the intelligent operation and maintenance scene is less than that of normal operation, so that the quantity difference of positive/negative example samples in the obtained training data set is very different, the detection effect of the model on data in a small number of fault scenes is improved by adopting a learning technology method based on sample imbalance, and intelligent fault early warning/detection is realized by using a LightGBM classifier. In step S3, the fault detection result includes a current fault detection result of the system, and may also include a prediction result of the system fault; based on the result, the operation and maintenance policy set in advance may be combined to perform the corresponding operation and maintenance processing, and the process may be set by a person skilled in the art according to actual needs, which is not further limited herein.

As an embodiment of the present application, as shown in fig. 1B, a method for constructing a fault detection model provided by the present application specifically includes:

s4, acquiring historical system operation data, and preprocessing the historical system operation data to generate training sample data;

the pretreatment process has been described in detail in the foregoing embodiments, and will not be described in detail herein.

S5, sampling the training sample data through a preset loss function to obtain an effective sample with balanced positive and negative samples;

and S6, constructing the fault detection model through a tree type algorithm according to the loss function and the effective sample.

As an embodiment of the present application, as shown in fig. 2, the obtaining an effective sample with balanced positive and negative samples by sampling the training sample data through a preset loss function includes:

step S21, defining a class balance item between the sampling non-weight and the reverse class frequency weight through a preset loss function;

and S22, sampling the training sample data according to the class balance item to obtain a positive and negative sample balanced effective sample.

In the above embodiments, the learning technique based on the sample imbalance is implemented by defining a special predetermined loss function (loss) calculated as shown in formula (1).

Where n _ y is the number of valid samples for category y. β =0 corresponds to no re-weighting and β → 1 corresponds to the inverse frequency being weighted. In the loss function, the acquisition of the positive and negative sample balance is realized by using β as a class balance term, and specific implementation and principle will be described in detail in the following embodiments, which are not given by way of example.

In an embodiment of the present application, as shown in fig. 3, constructing the fault detection model according to the loss function and the valid samples by a tree type algorithm includes:

step S31, determining the number of effective samples corresponding to different types of samples according to the number of samples in the effective samples;

step S32, updating parameters of a preset loss function according to the number of effective samples corresponding to different types of samples;

and S33, constructing and obtaining the fault detection model through a tree type algorithm according to the effective samples in the training sample data and the loss function after the parameters are updated.

As shown in fig. 4, the step S33 of constructing and obtaining the fault detection model through a tree type algorithm according to the effective samples in the training sample data and the loss function after parameter update may include:

s41, determining a loss function negative gradient value corresponding to each effective sample by using a lifting tree type algorithm and a loss function after parameter updating;

s42, performing sample traversal on the effective sample by using a lifting tree type algorithm and a loss function negative gradient value to obtain a constraint parameter;

and S43, adjusting an initial gradient lifting algorithm regression model according to the constraint parameters to construct the fault detection model.

In particular, en represents the effective number of samples, and for simplicity, the present application defines a newly sampled data point as interacting with previously sampled data in only two ways: the probability of being completely in the previous sample dataset is p, or the probability of being completely outside the original dataset is 1-p; accordingly, the definition of an effective number is En = (1- β ^ N)/(1- β), where β = (N-1)/N.

In particular, assuming that there are already N-1 samples and that the nth sample is to be sampled, now the desired volume of previously sampled data is En-1, and the probability of the newly sampled data point overlapping the previous sample point is p = E (N-1)/N. Thus, the desired volume after the nth instance sampling is shown in equation (2).

At this time:

E _n -1＝(1-β ^n-1) /(1-β) (3)

then there are:

the proposition shows that the number of effective fruit samples is an exponential function of n, and the over-parameter beta epsilon [0,1) controls how fast En grows with n.

Wherein the content of the first and second substances,

the normal loss function is shown, where Softmax is used as the loss function, and given a sample labeled y, the Softmax cross-entropy (CE) loss of the sample is expressed as formula (5).

In summary, assuming that class y has n _ y training samples, the class balance cross entropy penalty (CB _ softmax) is shown in equation (6).

Furthermore, the new concept of effective sample count presented herein enables the use of a hyper-parameter β to smoothly adjust the class balance term between the weightless and reverse class frequency weights.

Referring to fig. 5, in an embodiment of the present application, the obtaining the constraint parameter by performing a sample traversal on the valid samples by using a lifting tree type algorithm and a negative gradient value of a loss function includes:

s51, traversing a sample by utilizing a lifting tree type algorithm and a loss function negative gradient value to obtain a loss function minimum value;

and S52, carrying out residual error fitting according to the minimum value of the loss function to obtain constraint parameters.

Referring to fig. 6 again, the step S51 of adjusting the regression model of the initial gradient boost algorithm according to the constraint parameter to construct the fault detection model may include:

s61, training a regression model of the gradient lifting algorithm of the next round according to the constraint parameters obtained by calculating the regression model of the gradient lifting algorithm of each round;

and S62, the initial gradient lifting algorithm regression model obtains the fault detection model through training of preset iteration times.

In the above embodiment, the loss function defined in the foregoing embodiment, i.e., the constraint parameter β, is mainly reused in the following training process of the GBDT model to overcome the problem caused by sample imbalance; specific examples of the method for obtaining the about beam parameters and the method for adjusting the gradient boost algorithm regression model by the constraint parameters will be described in detail in the following embodiments, and will not be described in detail herein.

As an embodiment of the present application, the method further comprises: and acquiring and storing the operation and maintenance feedback information, and updating the fault detection model by using the operation and maintenance feedback information.

The system operation and maintenance feedback information is obtained and stored, and specifically, the operation and maintenance feedback information may be information input by operation and maintenance personnel, including events of system detection errors and novel events. In addition, the operation and maintenance feedback information is added into an IT system operation and maintenance information base, and a new round of model training is performed again, so that the detection accuracy of the model is continuously improved.

In an embodiment of the present application, as shown in fig. 7, a schematic structural diagram of a system applying the intelligent operation and maintenance method in the embodiment of the present application is shown. The intelligent fault detection system overcomes the defects that the traditional operation and maintenance system needs a large amount of manual intervention in threshold setting, fault detection, slow response and the like, simultaneously utilizes a small sample learning technology, solves the problem of unbalanced positive and negative samples in an IT operation and maintenance scene, provides the intelligent fault detection system based on the innovative small sample learning technology, has better effect compared with the fault detection system obtained by using traditional machine learning and deep learning algorithm training, and is called as the AIOPS intelligent operation and maintenance system hereinafter.

The AIOPS, namely the intellectual Intelligence for IT Operations, applies Artificial Intelligence to the operation and maintenance field, and further solves the problem which cannot be solved by the traditional operation and maintenance method through machine learning, deep learning and other modes based on the existing operation and maintenance data (logs, monitoring information, application information and the like). AIOps does not rely on artificial designated rules, and advocates that the rules are continuously learned, refined and summarized by artificial intelligence algorithm automatically from massive operation and maintenance data (including events, various information data and manual processing logs of operation and maintenance personnel).

In general, the fault detection function in the AIOPS scenario is typically a classification task. And in a data set of a classification task, if the number of training examples from different classes is very different, the data set is considered to be class unbalanced. The imbalance ratio of the data may vary from task to task, ranging from less than ten to thousands of different. The classification tasks based on such data are collectively referred to as sample imbalance classification.

The AIOPS intelligent operation and maintenance system obtains various system level, application level and alarm information data in the operation process of an IT system, wherein the system level information (cpu utilization rate, memory utilization rate, network speed and the like) and the application level information (transaction success rate, transaction response time and the like) are included, and accumulated huge amount of original alarm information and corresponding alarm manual labeling information are used; then, intelligent fault early warning/detection can be carried out through a fault early warning/detection module and an alarm information processing module which are obtained through deep learning training, so that the system can monitor abnormity and give an alarm when the fault influence range is small, even before the fault occurs, and the operation and maintenance work can be carried out more smoothly; an innovative small sample learning technology is used in the model training process, so that the model has a better detection effect on abnormal samples with extremely small quantity ratio; meanwhile, the AIOPS intelligent operation and maintenance system can continuously collect error information, novel events and the like fed back by operation and maintenance personnel in the operation and maintenance process, and iteratively trains and updates a detection model of the system.

In this embodiment, the system shown in fig. 7 achieves intelligent fault early warning/detection for IT operation and maintenance information by acquiring various information data and corresponding intelligent processing modules, and can solve a large amount of problems of manpower and material resources compared with a conventional monitoring platform; and an innovative small sample learning technology is used in the model training process, so that the model related in the application has a better detection effect on abnormal samples with a very small number ratio compared with the traditional abnormal detection model. Then, carrying out subsequent operation and maintenance treatment according to the corresponding pre/alarm; in addition, the system can collect error information, novel events and the like fed back by operation and maintenance personnel, and carries out iterative training and updating on a detection model of the system, so that the pre-warning/warning accuracy rate is continuously improved.

Wherein, operation and maintenance information acquisition and preprocessing unit: the operation and maintenance information acquisition and preprocessing unit collects various information data in the operation process of the IT system and preprocesses the information data and transmits the information data to the fault intelligent early warning/detecting unit. The unit is responsible for collecting service success rate, system success rate, CPU utilization rate, memory utilization rate, service response time, service time consumption, network rate, monitoring messages, accumulated original alarm information, corresponding alarm manual labeling information and the like; and the module can carry out data preprocessing such as noise filtering, data cleaning, data correlation analysis, principal component analysis, PCA dimension reduction and the like on the information.

Fault intelligent early warning/detection unit: and the intelligent fault early warning/detecting unit acquires the data transmitted to the operation and maintenance information collecting and preprocessing unit, then carries out an intelligent fault early warning module, an intelligent fault detecting module and an intelligent alarm information duplicate removing module, and transmits the final fault information to a subsequent operation and maintenance processing unit. An innovative small sample learning technology is used in the model training process, so that the model has a better detection effect on abnormal samples with extremely small number ratio, early warning and real-time detection on faults are realized, the problems that a fixed monitoring threshold value is difficult to set artificially and the like can be avoided, and the intelligent fault warning/detecting module realizes the model training by using a LightGBM classifier;

a subsequent operation and maintenance processing unit: and the subsequent operation and maintenance processing unit is mainly responsible for performing subsequent operation and maintenance operations according to the pre-warning/warning information after receiving the fault information sent by the intelligent fault early-warning/detecting unit, wherein the subsequent operation and maintenance operations comprise warning information display, short message sending, information recording, load balancing, network current limiting, service current limiting, dynamic resource capacity expansion, main-standby switching and operation and maintenance personnel feedback.

In this embodiment, as shown in fig. 8, a process of generating operation and maintenance information acquisition, model training, model intelligent pre/alarm and subsequent operations in the intelligent operation and maintenance system based on the AIOPS is described.

The method comprises the following steps: and acquiring operation and maintenance information. Acquiring various kinds of operation and maintenance data information accumulated in the historical operation process of the IT system, wherein the information comprises the following information: the method comprises the following steps of (1) service success rate, system success rate, CPU utilization rate, memory utilization rate, service response time, service time consumption, network rate, monitoring messages, original alarm information, operation and feedback information of operation and maintenance personnel and corresponding labeling information; carrying out data preprocessing such as noise filtering, data cleaning, data correlation analysis, principal component analysis, PCA dimension reduction and the like on the information; and then, constructing an IT system operation and maintenance information base by using the original information data and the preprocessed and labeled data.

Step two: and training a fault intelligent early warning/detection model. Training by using the IT system operation and maintenance information base obtained in the first step to obtain an intelligent fault early warning and detection discrimination model and an intelligent alarm information duplication removal model, so that an alarm threshold value can not be set manually any more; because the frequency of occurrence of faults in the AIOPS scene is less than that of normal operation, the quantity difference of positive/negative example samples in the obtained training data set is very large, and the detection effect of the model on data in a few fault scenes is improved by adopting a learning technology method based on sample imbalance; fault intelligent early warning/detection is implemented using the LightGBM classifier.

The GBDT algorithm is described as follows: the Boosting method is a common integrated learning method, and obtains the result of adding a plurality of models according to weights by comparing error conditions of results of each iteration, adjusting weights of samples and training a new round of learner. The GBDT combines a plurality of decision trees by using a Boosting idea, calculates a loss function by using a gradient descent method, and finally forms a strong learner, so that the GBDT is called a gradient lifting tree.

Further, GBDT (Gradient Boosting Decision Tree) belongs to a lifting Tree type algorithm, and is called a Gradient lifting Tree because it is based on a model established in the previous round, and a new round of learning model is established in a manner that a loss function is most rapidly reduced according to a negative Gradient. GBDT can be regarded as a strong learner combined from several weak learnings, assuming that the integration result of the weak learners obtained before t rounds is f _t-1 (x) The loss function is L (y, f) _t-1 (x) Y is the true value of the sample, x identifies the field of the input sample, then the goal is to find a decision tree weak learner h _t (x) Minimize the loss function for t rounds, i.e., minimize the formula for L (y, f) _t (x))＝L(y,f _t-1 (x)+h _t (x))。

Specifically, how to make the loss function continuously develop in the direction of minimum value is the key point of GBDT, so Freidman proposes a way of determining the loss function value in which the negative gradient decreases fastest, and uses the loss function value as an approximate value of the residual error of the lifting tree in the regression problem, thereby training the next round of decision tree. Therefore, the negative gradient expression of the ith sample of the tth round is shown in formula (7).

Therefore, a complete gradient boost algorithm regression model can be represented by the following flow:

input deviceTraining sample T = { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _N ,y _N ) And, initializing a regression decision tree f ₀ (x) As shown in equation (8).

Where c is the resulting output value that minimizes the loss function, N represents the total number of input samples, y _i The real value of the ith sample;

2) And for each sample i, taking the negative gradient value of the loss function at the moment as the residual value of each sample under the current model, and using the obtained result to train a new regression tree.

3) All samples in the leaf nodes are traversed to determine the output value of the t-th tree that minimizes the loss function, as shown in equation (9).

Thereby obtaining a fitting function of the t-th round regression tree as shown in equation (10).

The strong learner expression after the t-th round is shown in equation (11).

And repeating the steps 2) to 4) to obtain the expression of the final GBDT strong learner.

Furthermore, GBDT fits the residual error in a continuous iteration mode according to the method of the minimum value of the negative gradient each time, thereby achieving the purpose of converging the true value and the predicted value. GBDT can flexibly process various types of data as a Boosting result, and prediction accuracy is high.

Step three: intelligent fault early warning/detection. In the actual operation process, the system acquires instant operation and maintenance information, acquires corresponding early warning/fault information after the detection of the trained fault intelligent early warning/detection model, and performs duplication elimination on related alarm information.

Step four: and (5) subsequent operation and maintenance treatment. According to the early warning/fault information transmitted in the third step, the system or the operation and maintenance personnel performs corresponding operation and maintenance operations, which comprises the following steps: the method comprises the steps of alarm information display, short message sending, information recording, load balancing, network current limiting, service current limiting, dynamic resource capacity expansion, main and standby switching and feedback to operation and maintenance personnel.

Step five: and (4) completing an IT system operation and maintenance information base and iteratively updating a model. The operation and maintenance personnel continuously supplement feedback information, including events of system detection errors and novel events, add the feedback information into an IT system operation and maintenance information base, and then train a new round of models, so that the detection accuracy of the models is continuously improved.

Therefore, the system can realize early warning and intelligent detection of system faults and realize the function of model iterative updating.

The AIOPS intelligent operation and maintenance system obtains various system level, application level and alarm information data in the operation process of an IT system, wherein the system level information comprises system level information (cpu utilization rate, memory utilization rate, network speed and the like) and application level information (transaction success rate, transaction response time and the like), and accumulated huge amount of original alarm information and corresponding alarm artificial labeling information are used; then, intelligent fault early warning/detection can be carried out through a fault early warning/detection module and an alarm information processing module which are obtained through deep learning training, so that the system can monitor abnormity and give an alarm when the fault influence range is small, even before the fault occurs, and the operation and maintenance work can be carried out more smoothly; an innovative sample-based balanced loss technology is used in the model training process, so that the model has a better detection effect on abnormal samples with extremely small number ratio; meanwhile, the AIOPS intelligent operation and maintenance system can continuously collect error information, novel events and the like fed back by operation and maintenance personnel in the operation and maintenance process, and iteratively trains and updates a detection model of the system.

Fig. 9 is a schematic structural diagram of an intelligent operation and maintenance device according to an embodiment of the present application, where the device includes:

the acquisition module 10 is used for acquiring real-time operation data acquired by the operation and maintenance system;

the analysis module 20 is configured to analyze the real-time operation data through a fault detection model constructed based on a sample equalization loss technology to obtain a fault detection result;

and the processing module 30 is configured to obtain an operation and maintenance policy according to the fault detection result, and perform corresponding operation and maintenance processing according to the operation and maintenance policy.

The present application also provides a computer-readable storage medium storing a computer program for executing the above method.

As shown in fig. 10, the electronic device 600 may further include: communication module 110, input unit 120, audio processor 130, display 160, power supply 170. It is noted that the electronic device 600 does not necessarily include all of the components shown in FIG. 10; furthermore, the electronic device 600 may also comprise components not shown in fig. 10, which may be referred to in the prior art.

As shown in fig. 10, the central processor 100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processor 100 receiving input and controlling the operation of the various components of the electronic device 600.

The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 100 may execute the program stored in the memory 140 to realize information storage or processing, etc.

The input unit 120 provides an input to the cpu 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used to display an object to be displayed, such as an image or a character. The display may be, for example, an LCD display, but is not limited thereto.

The memory 140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage section 142, and the application/function storage section 142 is used to store application programs and function programs or a flow for executing the operation of the electronic device 600 by the central processing unit 100.

The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage portion 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging application, address book application, etc.).

The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. The communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.

Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and receive audio input from the microphone 132 to implement general telecommunications functions. Audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 130 is also coupled to the central processor 100, so that recording on the local can be enabled through a microphone 132, and so that sound stored on the local can be played through a speaker 131.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The principle and the implementation mode of the present application are explained by applying specific embodiments in the present application, and the description of the above embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An intelligent operation and maintenance method, characterized in that the method comprises:

acquiring real-time operation data acquired by an operation and maintenance system;

analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result;

and obtaining an operation and maintenance strategy according to the fault detection result, and performing corresponding operation and maintenance processing according to the operation and maintenance strategy.

2. The intelligent operation and maintenance method according to claim 1, further comprising:

obtaining historical system operation data, and preprocessing the historical system operation data to generate training sample data;

sampling the training sample data through a preset loss function to obtain an effective sample with balanced positive and negative samples;

and constructing the fault detection model through a tree type algorithm according to the loss function and the effective samples.

3. The intelligent operation and maintenance method according to claim 2, wherein the obtaining of the effective sample with balanced positive and negative samples by sampling the training sample data through a preset loss function comprises:

defining a class balance term between the sampling non-weight and the reverse class frequency weight through a preset loss function;

and sampling the training sample data according to the class balance item to obtain an effective sample with balanced positive and negative samples.

4. The intelligent operation and maintenance method according to claim 2, wherein constructing the fault detection model through a tree type algorithm according to the loss function and the valid samples comprises:

determining the number of effective samples corresponding to different types of samples according to the number of samples in the effective samples;

updating parameters of a preset loss function according to the number of effective samples corresponding to different types of samples;

and constructing and obtaining the fault detection model through a tree type algorithm according to the effective samples in the training sample data and the loss function after the parameters are updated.

5. The intelligent operation and maintenance method according to claim 4, wherein the step of constructing and obtaining the fault detection model through a tree type algorithm according to the effective samples in the training sample data and the loss function after parameter update comprises:

determining a loss function negative gradient value corresponding to each effective sample by using a lifting tree type algorithm and a loss function after parameter updating;

performing sample traversal on the effective sample by using a lifting tree type algorithm and a loss function negative gradient value to obtain a constraint parameter;

and adjusting an initial gradient lifting algorithm regression model according to the constraint parameters to construct the fault detection model.

6. The intelligent operation and maintenance method according to claim 5, wherein performing sample traversal on the valid samples by using a lifting tree type algorithm and a negative gradient value of a loss function to obtain constraint parameters comprises:

performing sample traversal by using a lifting tree type algorithm and the negative gradient value of the loss function to obtain the minimum value of the loss function;

and performing residual error fitting according to the minimum value of the loss function to obtain constraint parameters.

7. The intelligent operation and maintenance method according to claim 5, wherein the adjusting an initial gradient boosting algorithm regression model according to the constraint parameters to construct the fault detection model comprises:

training a regression model of the gradient lifting algorithm of the next round according to the constraint parameters calculated by the regression model of the gradient lifting algorithm of each round;

and the initial gradient lifting algorithm regression model obtains the fault detection model through training of preset iteration times.

8. An intelligent operation and maintenance device, the device comprising:

the acquisition module is used for acquiring real-time operation data acquired by the operation and maintenance system;

the analysis module is used for analyzing the real-time operation data through a fault detection model constructed based on a sample balance loss technology to obtain a fault detection result;

and the processing module is used for obtaining an operation and maintenance strategy according to the fault detection result and carrying out corresponding operation and maintenance processing according to the operation and maintenance strategy.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 7.