WO2024001254A1

WO2024001254A1 - Server anomaly detection method and apparatus, device, and readable storage medium

Info

Publication number: WO2024001254A1
Application number: PCT/CN2023/078528
Authority: WO
Inventors: 邹德强; 满宏涛
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2022-06-28
Filing date: 2023-02-27
Publication date: 2024-01-04
Also published as: CN114826971A; CN114826971B

Abstract

The present application discloses a server anomaly detection method, comprising: performing feature extraction on pieces of received server system data; constructing binary trees according to the pieces of extracted feature data; calculating the average path length respectively corresponding to each piece of server system data in the binary tree group obtained by construction; when the presence of anomalous data is detected in the pieces of server system data according to the average path lengths, acquiring normal data and anomalous data obtained by remotely shunting the pieces of server system data; establishing a first multivariate Gaussian distribution model on the basis of the pieces of normal data, and establishing a second multivariate Gaussian distribution model on the basis of the pieces of anomalous data; and performing overlay anomaly detection on the pieces of server system data in combination with the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model. The present application improves detection efficiency and effectively mitigates the drawbacks generally associated with high computational loads such as distance-based anomaly detection. The present application further discloses an apparatus, a device, and a storage medium which have corresponding technical effects.

Description

A server anomaly detection method, device, equipment and readable storage medium

Cross-references to related applications

This application requests the priority of the Chinese patent application submitted to the China Patent Office on June 28, 2022, application number 202210738323.5, and the application name is "A server anomaly detection method, device, equipment and readable storage medium", and its entire content incorporated herein by reference.

Technical field

This application relates to the technical fields of artificial intelligence and anomaly detection, and in particular to a server anomaly detection method, device, equipment and non-volatile readable storage medium.

Background technique

Anomaly detection is the detection of illogical abnormal data in the data set, that is, outliers, inconsistencies, and special points. It is suitable for system health detection, sensor network event detection, fault detection, etc., to ensure the normal operation of the system ecosystem. Anomaly detection is one of the applications of machine learning. In summary, the algorithm principle is based on probability statistics, nearest neighbor, clustering, etc. There are many classic algorithms and derivative algorithms, which can be divided into supervised learning, unsupervised learning and Semi-supervised learning, etc.

BMC (Baseboard Management Controller) is the "big housekeeper" of the entire server system. It has a series of monitoring and control functions. It uses sensors to monitor system component temperature, humidity, voltage, fans, power supply, communication parameters, and operations. System functions, etc., make appropriate adjustments to keep the system in a healthy state. BMC has a wealth of solutions. The joint monitoring method of server in-band and out-of-band can retrieve the status information of any system, such as CPU (Central Processing Unit, central processing unit) load, memory usage, network traffic, and sector disks. number of channels, etc.

Currently, BMC generally uses thresholds as judgment conditions when detecting server systems. When the temperature exceeds the threshold, fans are used to lower the temperature to keep the system in a healthy state. However, this conditioned reflection lags behind slightly, and high temperature damage to components is irreversible and will reduce component life. When a major system risk occurs in the server, the fan cooling effect is weak, resulting in standby, crash and other adverse consequences. If reasonable responses and adjustments are not made, it will cause file loss and other situations, causing significant economic losses and also affecting production safety. bring hidden dangers. In the pre-researched BMC solution, traditional anomaly detection based on machine learning, especially distance-based, is prone to computational explosion.

Contents of the invention

The purpose of this application is to provide a server anomaly detection method. This method can scientifically allocate computing resources through dual-end collaborative anomaly detection, prevent the explosion of computing volume, improve detection efficiency, and effectively avoid high-burden detection such as distance-based anomaly detection. The disadvantages of load computing; another purpose of this application is to provide a server anomaly detection device, equipment and non-volatile readable storage medium.

In order to solve the above technical problems, this application provides the following technical solutions:

A server anomaly detection method, including:

Receive system data from each server;

Perform feature extraction on each server system data to obtain each feature data;

Construct a binary tree based on each feature data to obtain each binary tree;

Calculate the average path length corresponding to each server system data in the binary tree group composed of each binary tree;

When abnormal data is detected in the data of each server system based on the average path length, obtain the normal data and abnormal data obtained by the remote end's offloading of the data of each server system;

Establish a first multivariate Gaussian distribution model based on each normal data, and establish a second multivariate Gaussian distribution model based on each abnormal data;

The first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model are combined to perform superimposed anomaly detection on each server system data.

In some embodiments, the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model are combined to perform superimposed anomaly detection on each server system data, including:

Use the first multivariate Gaussian distribution model to calculate the normal probability corresponding to each server system data, and use the second multivariate Gaussian distribution model to calculate the abnormal probability corresponding to each server system data;

Obtain the preset normal probability threshold and abnormal probability threshold, and perform superimposed abnormality detection based on the normal probability threshold, abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data for each server system data.

In some embodiments, when abnormal data is detected in each server system data based on each average path length, it also includes:

Obtain the first abnormality detection result;

The first abnormality detection result is fed back to the baseboard management controller, so that the baseboard management controller controls the fan to cool down the corresponding system component.

In some embodiments, after performing superimposed anomaly detection for each server system data in combination with the normal probability threshold, the abnormal probability threshold, and the normal probability and abnormal probability corresponding to the server system data, it also includes:

Obtain the second anomaly detection result obtained by superimposed anomaly detection;

The server abnormality maintenance operation is performed based on the first abnormality detection result and the second abnormality detection result.

In some embodiments, performing server abnormality maintenance operations in combination with the first anomaly detection result and the second anomaly detection result includes:

When the first abnormality detection result is that abnormal data exists, and the second abnormality detection result is that there is server system data whose normal probability is not within the normal probability threshold and the abnormality probability is within the abnormality probability threshold, a disk archiving instruction is sent to the baseboard management controller, In order to enable the baseboard management controller to perform a disk seal operation and send an abnormality detection report to the superior;

When the first abnormality detection result is that abnormal data exists and the second abnormality detection result is that there is no server system data with an abnormality probability within the abnormality probability threshold, a fan control instruction is sent to the baseboard management controller so that the baseboard management controller controls the fan. Cool down the corresponding system components;

When the first abnormality detection result is that abnormal data exists, and the second abnormality detection result is that there is server system data with a normal probability within the normal probability threshold and an abnormality probability within the abnormality probability threshold, sending a fan control instruction to the baseboard management controller, So that the baseboard management controller controls the fan to cool down the corresponding system components.

In some embodiments, binary tree construction is performed based on each feature data, including:

Each distributed computing structural unit in the baseboard management controller is used to construct a preset number of binary trees in parallel based on each characteristic data.

In some embodiments, when abnormal data is detected in each server system data based on each average path length, each normal data and each abnormal data obtained by the remote end's offloading of each server system data are obtained, including:

Calculate the anomaly score of each server system data in the binary tree group based on each average path length;

When it is detected that abnormal data exists in each server system data according to each abnormality score, each normal data and each abnormal data obtained by remotely diverting each server system data are obtained.

In some embodiments, after receiving each server system data, it also includes:

Store each server system data in a temporary storage module with queue attributes;

Perform feature extraction on each server system data, including:

Obtain each server system data from the temporary storage module, and perform feature extraction on each server system data.

In some embodiments, after combining the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model to perform superimposed anomaly detection on each server system data, it also includes:

When there is abnormal data in each server system data, the abnormal data in the temporary storage module is eliminated.

In some embodiments, feature extraction is performed on each server system data, including:

Randomly select a preset number of server system data from each server system data;

Perform feature extraction on the selected server system data.

In some embodiments, calculating the average path length corresponding to each server system data in a binary tree group composed of binary trees includes:

In a binary tree group composed of binary trees, for each server system data, calculate the distance from the leaf node where the server system data is located in each binary tree to the root node, and obtain the path length of the server system data on each binary tree;

Calculate the average path length on each binary tree to obtain the average path length corresponding to the server system data.

When it is determined that there is an average path length smaller than the preset abnormal path length threshold, each normal data and each abnormal data obtained by the remote end diverting the data of each server system are obtained.

A server anomaly detection device, including:

Data receiving module, used to receive data from each server system;

The feature extraction module is used to extract features from each server system data and obtain each feature data;

The binary tree building module is used to construct binary trees based on each feature data to obtain each binary tree;

The path length calculation module is used to calculate the average path length corresponding to each server system data in the binary tree group composed of each binary tree;

The data acquisition module is used to obtain the normal data and the abnormal data obtained by remotely diverting the data of each server system when abnormal data is detected in the data of each server system based on the average path length;

The model building module is used to establish a first multivariate Gaussian distribution model based on each normal data, and establish a second multivariate Gaussian distribution model based on each abnormal data;

The superposition anomaly detection module is used to perform superposition anomaly detection on each server system data by combining the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model.

A server anomaly detection device, including:

Memory, used to store computer programs;

The processor is used to implement the steps of the previous server anomaly detection method when executing the computer program.

A non-volatile readable storage medium. A computer program is stored on the non-volatile readable storage medium. When the computer program is executed by a processor, the steps of the previous server anomaly detection method are implemented.

The server anomaly detection method provided by this application receives data from each server system; performs feature extraction on each server system data to obtain each feature data; constructs a binary tree based on each feature data to obtain each binary tree; calculates the binary tree composed of each binary tree The average path length corresponding to each server system data in the group; when based on each average path When the length detects abnormal data in the data of each server system, it obtains the normal data and the abnormal data obtained by shunting the data of each server system from the remote end; establishes the first multivariate Gaussian distribution model based on each normal data, and based on each abnormality The second multivariate Gaussian distribution model is established for the data; the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model are combined to perform overlay anomaly detection on each server system data.

It can be seen from the above technical solution that by performing feature extraction on the received server system data at the near end, each binary tree is constructed based on the extracted feature data, and each server system data in the binary tree group composed of each binary tree is calculated. The average path length is, and initial anomaly detection is performed on each server system data based on each average path length. When the remote end receives the data of each server system, it will pre-divide the data of each server system into each normal data and each abnormal data. When the detection result of the initial abnormality detection at the near end is that there is abnormal data, the remote end will obtain the data of each server system. The server system data is divided into normal data and abnormal data, and a multivariate Gaussian distribution model is established based on the normal data and abnormal data respectively, so that superimposed abnormality detection is performed on each server system data at the remote end. Proximal anomaly detection has the characteristics of edge computing, omitting the data transmission process and responding faster. When the near-end detects an abnormality in the server system data, it can promptly protect the system components before they start to heat up or before they heat up to prevent high temperature damage to the components. It can also maintain the system's optimal working status and efficient output. The remote end uses a multivariate Gaussian distribution model to perform global anomaly detection, which is triggered by the near-end anomaly detection and performs superimposed anomaly detection to predict major risks such as server standby and crash, so that maintenance measures can be taken in advance. Through dual-end collaborative anomaly detection, computing resources can be scientifically allocated to prevent the explosion of calculations, improve detection efficiency, and effectively avoid the disadvantages of high-load computing such as distance-based anomaly detection.

Correspondingly, this application also provides server anomaly detection devices, equipment and non-volatile readable storage media corresponding to the above-mentioned server anomaly detection method, which have the above technical effects and will not be described again here.

Description of drawings

In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is an implementation flow chart of the server anomaly detection method in the embodiment of the present application;

Figure 2 is another implementation flow chart of the server anomaly detection method in the embodiment of the present application;

Figure 3 is a structural block diagram of a server anomaly detection device in an embodiment of the present application;

Figure 4 is a structural block diagram of a server anomaly detection device in an embodiment of the present application;

Figure 5 is a schematic structural diagram of a server anomaly detection device provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of a non-volatile readable storage medium provided by an embodiment of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the solution of the present application, the present application will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

Referring to Figure 1, Figure 1 is an implementation flow chart of a server anomaly detection method in an embodiment of the present application. The method may include the following steps:

S101: Receive system data from each server.

During the operation of the server, server system data corresponding to each system component will be generated, and the baseboard management controller receives each server system data.

S102: Perform feature extraction on each server system data to obtain each feature data.

After receiving each server system data, feature extraction is performed on each server system data to obtain each feature data. Characteristic data can include CPU temperature, voltage, memory usage, CPU load, network traffic, etc.

In a specific implementation of the present application, after step S101, the method may further include the following steps:

Correspondingly, feature extraction of each server system data may include the following steps:

The baseboard management controller includes a temporary storage module integrated inside the chip. After receiving the data of each server system, the baseboard management controller can store the data of each server system in the temporary storage module. The temporary storage module can be set as a storage unit with queue attributes, that is, data is first in, first out, and is used to temporarily store server system data. When the temporary storage module is saturated, the data is slid and stored. One unit data x ^* slides in from the left end, and one unit data slides out from the right end. The newly slid in unit data is marked as the data point to be detected x ^* . There is a data collection process in the initial stage. When the temporary storage module is saturated, the edge-end (i.e., near-end) anomaly detection environment is ready. It is assumed that the server system generates status information every 15 minutes, that is, one unit of data, and the temporary storage module slides in one unit of data.

In a specific implementation of this application, feature extraction of each server system data may include the following steps:

Step 1: Randomly select a preset number of server system data from each server system data;

Step 2: Extract features from the selected server system data.

For convenience of description, the above two steps can be combined for explanation.

It is also possible to randomly select a preset number of server system data from all server system data after receiving each server system data, that is, randomly select a part of the server system data, and only perform feature extraction on the selected server system data. By randomly selecting a part of the server system data for feature extraction, and selecting a part of the features from all the extracted features for binary tree construction, it can not only ensure the diversity of the server system data on each tree, but also reduce memory consumption and avoid Dimensional disaster. When selecting features, you can select features by random selection, taking full advantage of the fast speed of random selection, or you can select features by using kurtosis testing to ensure It has better feature selection effect.

S103: Construct a binary tree based on each feature data to obtain each binary tree.

After each feature data is extracted from each server system data, a binary tree is constructed based on each feature data. For example, the bagging method can be used to construct a binary tree to obtain each binary tree.

When building a binary tree, put the selected server system data into the root node, randomly select a feature from the pre-selected feature data, and randomly generate a cutting point c in the current feature. The cutting point c is generated from the minimum value of the feature. between the value and the maximum value, use this cutting point to generate a hyperplane, divide the server system data space into two subspaces, put the server system data that is less than c under this feature on the left subtree, and put the server system data that is greater than or equal to c under this feature The server system data is placed in the right subtree. Each subtree recursively steps to divide the server system data and continuously construct new subtrees until the termination condition is met.

Termination conditions can include:

(1) Segment the points to be detected;

(2) The subtree has reached the limited height l=ceiling(log ₂ _nt ), where _nt is the total number of pre-selected server system data;

(3) All characteristic values of the server system data on the subtree are the same;

(4) The subtree cannot be further divided.

S104: Calculate the average path length corresponding to each server system data in the binary tree group composed of each binary tree.

After each binary tree is constructed, the average path length corresponding to each server system data in the binary tree group composed of each binary tree is calculated.

In a specific implementation of the present application, step S104 may include the following steps:

Step 1: In the binary tree group composed of each binary tree, for each server system data, calculate the distance from the leaf node where the server system data is located in each binary tree to the root node, and obtain the path length of the server system data on each binary tree;

Step 2: Calculate the average path length on each binary tree to obtain the average path length corresponding to the server system data.

When calculating the average path length corresponding to each server system data, first calculate the distance from the leaf node to the root node in each binary tree for each server system data, and obtain the path of the server system data on each binary tree. The length is h(x). Then the average path length h(x) on each binary tree is calculated to obtain the average path length E[h(x)] corresponding to the server system data.

S105: When abnormal data is detected in each server system data based on each average path length, obtain the normal data and abnormal data obtained by the remote end's offloading of each server system data.

After calculating the average path length corresponding to each server system data in the binary tree group composed of each binary tree, it is determined whether there is abnormal data in each server system data based on each average path length. When the server system data is sent to the near end, the same server system data will also be sent to the remote end (such as a cloud platform), and the remote end will The server system data is divided into normal data and abnormal data. When abnormal data is detected in the data of each server system based on the average path length, remote abnormality detection is triggered to obtain the normal data and abnormal data obtained by the remote end shunting the data of each server system.

In a specific implementation of the present application, step S105 may include the following steps:

Step 1: Calculate the anomaly score of each server system data in the binary tree group based on the average path length;

Step 2: When it is detected that abnormal data exists in each server system data according to each abnormality score, obtain each normal data and each abnormal data obtained by shunting each server system data from the remote end.

After calculating the average path length corresponding to each server system data in the binary tree group composed of each binary tree, the anomaly score of each server system data in the binary tree group can be calculated based on each average path length. When it is detected that abnormal data exists in each server system data according to each abnormality score, each normal data and each abnormal data obtained by remotely diverting each server system data are obtained.

The anomaly score can be calculated based on the relationship between the anomaly score, the average path length and the height of the binary tree. Given a data set of n samples, the height of the binary tree is:
w(n)=2H(n-1)-(2(n-1)-n);

Among them, H(i)=ln(i)+0.5772156649 is the harmonic number.

The anomaly score can map the anomaly concept to the [0, 1] interval, which is defined as follows:

Set the threshold δ, δ and _ha are mapping relationships, that is, one-to-one correspondence. The mapping formula is: If and only if s(x ^(*) , n)>δ, the server system data x ^(*) to be detected is determined to be abnormal.

Generally, when s(x ^(*) , n) tends to 1, the server system data x ^(*) to be detected is determined to be abnormal. When s(x ^(*) , n) tends to 0, the server system to be detected Data x ^(*) is judged to be normal.

Generally, the average path length E[h(x)] of abnormal data is short and easy to segment. The abnormal path length threshold _ha can be set in advance, when it is determined that there is an average path length smaller than the preset abnormal path length threshold, such as when there is an average path length E[h(x ^* )]≤ of the server system data x ^(* ) When h _a , sample x ^(*) is judged to be abnormal. In this case, obtain the normal data and abnormal data obtained by the remote end's offloading of system data of each server.

S106: Establish a first multivariate Gaussian distribution model based on each normal data, and establish a second multivariate Gaussian distribution model based on each abnormal data.

After obtaining the normal data and the abnormal data obtained by remotely shunting the data of each server system, a first multivariate Gaussian distribution model is established based on the normal data, and a second multivariate Gaussian distribution model is established based on the abnormal data.

In the process of establishing the first multivariate Gaussian distribution model, the mean μ ₁ and covariance ∑ ₁ of N ₁ normal data are calculated through the following formula:

The first multivariate Gaussian distribution model p ₁ (x) of normal data can be obtained:

In the process of establishing the second multivariate Gaussian distribution model, the mean μ ₂ and covariance ∑ ₂ of N ₂ normal data are calculated through the following formula:

The second multivariate Gaussian distribution model p ₂ (x) of the probability model for abnormal data can be obtained:

Thus, the first multivariate Gaussian distribution model established based on each normal data and the second multivariate Gaussian distribution model established based on each abnormal data are obtained.

S107: Combine the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model to perform superimposed anomaly detection on each server system data.

After establishing the first multivariate Gaussian distribution model based on each normal data and establishing the second multivariate Gaussian distribution model based on each abnormal data, the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model are combined to analyze each server system data. Perform overlay anomaly detection.

In a specific implementation manner of the present application, after step S107, the method may also include the following steps: when abnormal data exists in each server system data, removing the abnormal data in the temporary storage module. Following the above example, when the data point x ^* to be detected is abnormal, the data flow in the temporary storage module is not slid but is directly eliminated. This achieves the separation of normal data and abnormal data.

It can be seen from the above technical solution that by performing feature extraction on the received server system data at the near end, each binary tree is constructed based on the extracted feature data, and each server system data in the binary tree group composed of each binary tree is calculated. The average path length is, and initial anomaly detection is performed on each server system data based on each average path length. When the remote end receives the data of each server system, it will pre-divide the data of each server system into each normal data and each abnormal data. When the detection result of the initial abnormality detection at the near end is that there is abnormal data, the remote end will obtain the data of each server system. The server system data is divided into normal data and abnormal data, and a multivariate Gaussian distribution model is established based on the normal data and abnormal data respectively, so that superimposed abnormality detection is performed on each server system data at the remote end. Proximal anomaly detection has the characteristics of edge computing, omitting the data transmission process and responding faster. When the near-end detects an abnormality in the server system data, it can promptly protect the system components before they start to heat up or before they heat up to prevent high temperature damage to the components. Even if the system is damaged, it can also maintain the optimal working condition of the system and produce efficient output. The remote end uses a multivariate Gaussian distribution model to perform global anomaly detection, which is triggered by the near-end anomaly detection and performs superimposed anomaly detection to predict major risks such as server standby and crash, so that maintenance measures can be taken in advance. Through dual-end collaborative anomaly detection, computing resources can be scientifically allocated to prevent the explosion of calculations, improve detection efficiency, and effectively avoid the disadvantages of high-load computing such as distance-based anomaly detection.

It should be noted that, based on the above embodiments, the embodiments of the present application also provide corresponding improvement solutions. In the subsequent embodiments, the same steps or corresponding steps as in the above embodiments may be referred to each other, and the corresponding beneficial effects may also be referred to each other, which will not be described again in the following improved embodiments.

Referring to Figure 2, Figure 2 is another implementation flow chart of the server anomaly detection method in the embodiment of the present application. The method may include the following steps:

S201: Receive system data from each server.

S202: Perform feature extraction on each server system data to obtain each feature data.

S203: Construct a binary tree based on each feature data to obtain each binary tree.

In a specific implementation of the present application, constructing a binary tree based on each feature data may include the following steps:

There are multiple distributed computing structural units in the baseboard management controller, and the number of binary trees to be constructed is preset. When building the binary tree, each distributed computing structural unit in the baseboard management controller is used to perform a preset number of binary trees in parallel based on each characteristic data. Construct. By utilizing each distributed computing structural unit to construct each binary tree in parallel, the efficiency of binary tree construction is greatly improved.

An attention mechanism is added to the construction process of the binary tree, which only cares about the segmentation of the point x ^* to be detected. Therefore, the binary tree does not need to segment all data points and can be stopped in advance to improve efficiency.

S204: Calculate the average path length corresponding to each server system data in the binary tree group composed of each binary tree.

S205: When abnormal data is detected in each server system data according to each average path length, obtain the first abnormality detection result.

When abnormal data is detected in each server system data according to each average path length, the first abnormality detection result is obtained. The first abnormality detection result may include the specific component in which the abnormality occurs.

S206: Feed back the first abnormality detection result to the baseboard management controller, so that the baseboard management controller controls the fan to cool down the corresponding system component.

After obtaining the first abnormality detection result, the first abnormality detection result is fed back to the baseboard management controller. After receiving the first abnormality detection result, the baseboard management controller can parse out which system component is abnormal, and then Control the fan to cool down the corresponding system components, so that when the near-end detects (or predicts) an abnormality in the server system data, the system components can be protected at the beginning of the heating (or before the temperature rises) to prevent the high temperature from damaging the components. Even if it is damaged, it can still maintain the system's optimal working condition and efficient output.

S207: Obtain the normal data and abnormal data obtained by the remote end from distributing the data of each server system.

S208: Establish a first multivariate Gaussian distribution model based on each normal data, and establish a second multivariate Gaussian distribution model based on each abnormal data.

S209: Use the first multivariate Gaussian distribution model to calculate the normal probability corresponding to each server system data, and use the second multivariate Gaussian distribution model to calculate the abnormal probability corresponding to each server system data.

After establishing the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model, the first multivariate Gaussian distribution model is used to calculate the normal probability corresponding to each server system data, and the second multivariate Gaussian distribution model is used to calculate the normal probability of each server system data. The abnormal probabilities corresponding to the server system data respectively.

S210: Obtain the preset normal probability threshold and abnormal probability threshold, and for each server system data, perform superimposed abnormality detection based on the normal probability threshold, abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data.

Set the normal probability threshold and abnormal probability threshold in advance, obtain the preset normal probability threshold and abnormal probability threshold, and for each server system data, combine the normal probability threshold, abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data for superposition abnormal detection.

Following step S106, the thresholds ∈ _n and ∈ _a can be set. For the server system data to be detected, if and only if p ₁ (x ^(*) ) <∈ _n and p ₂ (x ^(*) ) <∈ _a , the model It will determine that an abnormality has occurred (or is about to occur) in the server, feedback to the baseboard management controller to seal the disk, and send a report to the superior, so that the operator can reasonably formulate a work plan and ensure the integrity of the work.

S211: Obtain the second anomaly detection result obtained by superimposed anomaly detection.

After superimposed abnormality detection is performed by combining the normal probability threshold, the abnormal probability threshold, and the normal probability and abnormal probability corresponding to the server system data, a second abnormality detection result obtained by superimposed abnormality detection is obtained. That is, by comparing the normal probability corresponding to the server system data with the normal probability threshold, and comparing the abnormal probability corresponding to the server system data with the abnormal probability threshold, the second abnormality detection result is obtained through the two comparison results.

S212: Perform server abnormality maintenance operations based on the first abnormality detection result and the second abnormality detection result.

After obtaining the first anomaly detection result and the second anomaly detection result, a server anomaly maintenance operation is performed based on the first anomaly detection result and the second anomaly detection result.

In a specific implementation manner of the present application, step S212 may include the following steps:

Step 1: When the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data whose normal probability is not within the normal probability threshold and the abnormal probability is within the abnormal probability threshold, send the disk to the baseboard management controller Archive instructions to cause the baseboard management controller to perform a disk archive operation and send an abnormality detection report to the superior;

Step 2: When the first abnormality detection result is that there is abnormal data and the second abnormality detection result is that there is no server system data with an abnormality probability within the abnormality probability threshold, send a fan control instruction to the baseboard management controller so that the baseboard management controller The controller controls the fan to cool down the corresponding system components;

Step 3: When the first abnormality detection result is that there is abnormal data, and the second abnormality detection result is that there is server system data with normal probability within the normal probability threshold and abnormality probability within the abnormality probability threshold, send the fan to the baseboard management controller Control instructions to enable the baseboard management controller to control fans to cool down corresponding system components.

For convenience of description, the above three steps can be combined for explanation.

When the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data whose normal probability is not within the normal probability threshold and the abnormal probability is within the abnormal probability threshold, it is considered normal if the normal probability value is greater than or equal to ∈ _n Probability threshold range, the abnormal probability value is less than ∈ _a , that is, when E[h(x ^* )]≤h _a &p ₁ (x ^(*) )＜∈ _n &p ₂ (x ^(*) )＜∈ _a or s(x ^(*) , n)＞δ&p ₁ (x ^(*) )＜∈ _n &p ₂ (x ^(*) )＜∈ _a When, it indicates that there is a serious abnormality in the system component, and a disk sealing instruction is sent to the baseboard management controller. The baseboard management controller performs the disk sealing operation according to the disk sealing instruction and sends an abnormality detection report to the superior.

When the first anomaly detection result is that there is abnormal data and the second anomaly detection result is that there is no server system data with an abnormal probability within the abnormal probability threshold, that is, the normal probability value is greater than or equal to ∈ _n as the normal probability threshold range, and the abnormal probability value Less than ∈ _a is the abnormal probability threshold range, when E[h(x ^* )]≤h _a &p ₂ (x ^(*) )＞∈ _a or s(x ^(*) , n)＞δ&p ₂ (x ^(*) )>∈ _a , it indicates that there is a minor abnormality in the system component, and a fan control instruction is sent to the baseboard management controller. The baseboard management controller controls the fan to cool down the corresponding system component according to the fan control instruction.

When the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data with normal probability within the normal probability threshold and abnormal probability within the abnormal probability threshold, that is, the normal probability value is greater than or equal to ∈ _n . Normal probability threshold range, abnormal probability value less than ∈ _a is the abnormal probability threshold range, when E[h(x ^* )]≤h _a &p ₁ (x ^(*) )＜∈ _n &p ₂ (x ^(*) )＜∈ When _a or s(x ^(*) , n)＞δ&p ₁ (x ^(*) )＜∈ _n &p ₂ (x ^(*) )＜∈ _a , it indicates that there is a minor abnormality in the system component, and the board management control The controller sends fan control instructions so that the baseboard management controller controls the fans to cool down the corresponding system components.

If engineering application scenarios are considered, the calculation method of the model can also be reasonably modified to achieve the desired effect and the calculation is cheap. Assuming that the server system data characteristics are independent, then:

in, is any characteristic data of the server system, then there are:

So:

Among them, the threshold ∈ is set, and if and only if p(x ^(*) ) <∈, the server system data x ^(*) is determined to be abnormal.

Corresponding to the above method embodiments, this application also provides a server anomaly detection device. The server anomaly detection device described below and the server anomaly detection method described above can be mutually referenced.

Referring to Figure 3, Figure 3 is a structural block diagram of a server anomaly detection device in an embodiment of the present application. The device may include:

Data receiving module 31 is used to receive data from each server system;

The feature extraction module 32 is used to extract features from each server system data to obtain each feature data;

The binary tree construction module 33 is used to construct a binary tree based on each feature data to obtain each binary tree;

The path length calculation module 34 is used to calculate the average path length corresponding to each server system data in the binary tree group composed of each binary tree;

The data acquisition module 35 is used to acquire the normal data and the abnormal data obtained by the remote end shunting the data of each server system when abnormal data is detected in each server system data according to each average path length;

The model building module 36 is used to establish a first multivariate Gaussian distribution model based on each normal data, and establish a second multivariate Gaussian distribution model based on each abnormal data;

The superimposed anomaly detection module 37 is used to perform superimposed anomaly detection on each server system data by combining the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model.

In a specific implementation of the present application, the superimposed anomaly detection module 37 includes:

The probability calculation submodule is used to calculate the normal probability corresponding to each server system data using the first multivariate Gaussian distribution model, and calculate the abnormal probability corresponding to each server system data using the second multivariate Gaussian distribution model;

The superimposed anomaly detection sub-module is used to obtain the preset normal probability threshold and abnormal probability threshold. For each server system data, it performs superimposed anomaly detection based on the normal probability threshold, abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data. .

In a specific implementation manner of the present application, the device may further include:

A first result obtaining module, configured to obtain a first abnormality detection result when abnormal data is detected in each server system data according to each average path length;

The component cooling module is used to feed back the first abnormality detection result to the baseboard management controller, so that the baseboard management controller controls the fan to perform a cooling operation on the corresponding system component.

The first result acquisition module performs superimposed anomaly detection on each server system data by combining the normal probability threshold, the abnormal probability threshold, and the normal probability and abnormal probability corresponding to the server system data, and obtains the second anomaly detection result obtained by the superimposed anomaly detection. ;

The server abnormality maintenance module is used to perform server abnormality maintenance operations based on the first abnormality detection result and the second abnormality detection result.

In a specific implementation of this application, the server exception maintenance module includes:

Disk archiving and report sending sub-module, used when the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data whose normal probability is not within the normal probability threshold and the abnormal probability is within the abnormal probability threshold, Send a disk sealing instruction to the baseboard management controller so that the baseboard management controller performs a disk sealing operation and sends an abnormality detection report to the superior;

The first component cooling submodule is configured to send a fan control instruction to the baseboard management controller when the first abnormality detection result is that abnormal data exists and the second abnormality detection result is that there is no server system data with an abnormality probability within the abnormality probability threshold. , so that the baseboard management controller controls the fan to cool down the corresponding system components;

The second component cooling submodule is used to send the cooling signal to the server when the first abnormality detection result is that there is abnormal data, and the second abnormality detection result is that there is server system data with normal probability within the normal probability threshold and abnormality probability within the abnormality probability threshold. The baseboard management controller sends a fan control instruction so that the baseboard management controller controls the fan to cool down the corresponding system component.

In a specific implementation of this application, the data acquisition module 35 includes:

The anomaly score calculation submodule is used to calculate the anomaly score of each server system data in the binary tree group based on each average path length;

The data acquisition sub-module is used to obtain the normal data and abnormal data obtained by remote-end shunting of each server system data when abnormal data is detected in each server system data based on each abnormality score.

The data storage module is used to store the data of each server system in a temporary storage module with queue attributes after receiving the data of each server system;

The feature extraction module 32 is specifically a module that obtains each server system data from the temporary storage module and performs feature extraction on each server system data.

The data elimination module is used to perform superimposed anomaly detection on each server system data by combining the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model. When there is abnormal data in each server system data, the temporary storage module The abnormal data is removed.

In a specific implementation of the present application, the feature extraction module 32 includes:

The data selection submodule is used to randomly select a preset number of server system data from each server system data;

The feature extraction submodule is used to extract features from the selected server system data.

In a specific implementation of the present application, the path length calculation module 34 includes:

The path length calculation submodule is used to calculate the distance from the leaf node where the server system data is located in each binary tree to the root node for each server system data in a binary tree group composed of binary trees, and obtain the distance between the server system data in each binary tree. The path on is long;

The average calculation submodule is used to average the path lengths on each binary tree to obtain the average path length corresponding to the server system data.

In a specific implementation manner of the present application, the data acquisition module 35 specifically acquires each normal data and each data obtained by the remote end shunting the data of each server system when it is determined that there is an average path length smaller than the preset abnormal path length threshold. Module for exception data.

Corresponding to the above method embodiment, refer to Figure 4, which is a schematic diagram of the server anomaly detection device provided by this application. The device may include:

Memory 332 for storing computer programs;

The processor 322 is configured to implement the steps of the server anomaly detection method of the above method embodiment when executing the computer program.

Specifically, please refer to Figure 5. Figure 5 is a schematic diagram of the specific structure of a server anomaly detection device provided in this embodiment. As shown in the figure, the server anomaly detection device may vary greatly due to different configurations or performance, and may include a processor (central processing unit, CPU) 322 (for example, one or more processors) and a memory 332. The memory 332 stores One or more computer applications 342 or data 344. Among them, the memory 332 may be short-term storage or persistent storage. The program stored in the memory 332 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the data processing device. Furthermore, the processor 322 may be configured to communicate with the memory 332 and execute a series of instruction operations in the memory 332 on the server anomaly detection device 301 . The server anomaly detection device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input and output interfaces 358, and/or, one or more operating systems 341.

The steps in the server anomaly detection method described above can be implemented by the structure of the server anomaly detection device.

Corresponding to the above method embodiment, with reference to Figure 6, the present application also provides a non-volatile readable storage medium. A computer program is stored on the non-volatile readable storage medium. When the computer program is executed by the processor, the Follow these steps:

Receive each server system data; perform feature extraction on each server system data to obtain each characteristic data; construct a binary tree based on each characteristic data to obtain each binary tree; calculate the average corresponding to each server system data in the binary tree group composed of each binary tree. Path length; when abnormal data is detected in each server system data according to each average path length, obtain each normal data and each abnormal data obtained by the remote end's shunting of each server system data; establish a first multiplex based on each normal data Gaussian distribution model, and establish a second multivariate Gaussian distribution model based on each abnormal data; combine the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model to perform superimposed anomaly detection on each server system data.

The non-volatile readable storage medium can include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc. The medium on which program code is stored.

For an introduction to the non-volatile readable storage medium provided by this application, please refer to the above method embodiments, and this application will not elaborate further here.

Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other. As for the devices, equipment and non-volatile readable storage media disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple. For relevant details, please refer to the description in the method section.

This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the technical solutions and core ideas of the present application. It should be noted that for those of ordinary skill in the art, several improvements and modifications can be made to the present application without departing from the principles of the present application, and these improvements and modifications also fall within the protection scope of the claims of the present application.

Claims

A server anomaly detection method, characterized by including:

Receive system data from each server;

Perform feature extraction on each server system data to obtain each feature data;

Construct a binary tree according to each of the characteristic data to obtain each binary tree;

Calculate the average path length corresponding to each of the server system data in the binary tree group composed of each of the binary trees;

When abnormal data is detected in each of the server system data according to each of the average path lengths, obtain the normal data and the abnormal data obtained by diverting the data of each of the server systems from the remote end;

Establish a first multivariate Gaussian distribution model based on each of the normal data, and establish a second multivariate Gaussian distribution model based on each of the abnormal data;

The first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model are combined to perform superimposed anomaly detection on each of the server system data.
The server anomaly detection method according to claim 1, characterized in that, combining the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model to perform superimposed abnormality detection on each of the server system data, including:

Using the first multivariate Gaussian distribution model to calculate the normal probability corresponding to each of the server system data, and using the second multivariate Gaussian distribution model to calculate the abnormal probability corresponding to each of the server system data;

Obtain the preset normal probability threshold and abnormal probability threshold, and perform superimposed abnormality detection for each server system data in combination with the normal probability threshold, the abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data.
The server anomaly detection method according to claim 2, characterized in that when abnormal data is detected in each of the server system data according to each of the average path lengths, it further includes:

Obtain the first abnormality detection result;

The first abnormality detection result is fed back to the baseboard management controller, so that the baseboard management controller controls the fan to perform a cooling operation on the corresponding system component.
The server anomaly detection method according to claim 3, characterized in that, for each server system data, the normal probability threshold, the abnormal probability threshold and the normal probability and abnormal probability corresponding to the server system data are combined. After superimposing anomaly detection, it also includes:

Obtain the second anomaly detection result obtained by superimposed anomaly detection;

Perform server abnormality maintenance operations in combination with the first abnormality detection result and the second abnormality detection result.
The server anomaly detection method according to claim 4, characterized in that, combining the first anomaly detection result and the second anomaly detection result to perform server anomaly maintenance operations, including:

When the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data with a normal probability that is not within the normal probability threshold and an abnormality probability that is within the abnormal probability threshold, The baseboard management controller sends a disk sealing instruction, so that the baseboard management controller performs a disk sealing operation and sends an abnormality detection report to a superior;

When the first abnormality detection result is that abnormal data exists and the second abnormality detection result is that there is no server system data with an abnormality probability within the abnormality probability threshold, sending a fan control instruction to the baseboard management controller, So that the baseboard management controller controls the fan to perform a cooling operation on the corresponding system component;

When the first anomaly detection result is that there is abnormal data, and the second anomaly detection result is that there is server system data with a normal probability within the normal probability threshold and an abnormality probability within the abnormal probability threshold, the The baseboard management controller sends a fan control instruction, so that the baseboard management controller controls the fan to perform a cooling operation on the corresponding system component.
The server anomaly detection method according to claim 1, wherein the binary tree construction based on each of the characteristic data includes:

Each distributed computing structural unit in the baseboard management controller is used to construct a preset number of binary trees in parallel according to each of the characteristic data.
The server anomaly detection method according to claim 6, wherein the binary tree includes a left subtree and a right subtree, and each distributed computing structural unit in the baseboard management controller is used according to each of the characteristic data. Performs a preset number of binary tree constructions in parallel, including:

Each distributed computing structural unit in the baseboard management controller is used to divide the server system data into the left subtree or the right subtree according to each of the characteristic data, and generate the preset number of binary trees.
The server anomaly detection method according to claim 7, characterized in that the use of each distributed computing structural unit in the baseboard management controller is used to divide the server system data into the left subsystem according to each of the characteristic data. tree or the right subtree to generate the preset number of binary trees, including:

Utilize each distributed computing structural unit in the substrate management controller to determine the current characteristic data and the cutting point corresponding to the current characteristic data from each of the characteristic data;

Place server system data smaller than the cut point in the left subtree, place server system data greater than or equal to the cut point in the right subtree, and generate the preset number of binary tree constructions until the preset termination condition is met. .
The server anomaly detection method according to any one of claims 1 to 6, characterized in that when abnormal data is detected in each of the server system data according to each of the average path lengths, the remote end is obtained for each of the server system data. The normal data and abnormal data obtained by shunting the server system data include:

Calculate the anomaly score of each server system data in the binary tree group according to each of the average path lengths;

When it is detected that abnormal data exists in each of the server system data according to each of the abnormal scores, each normal data and each abnormal data obtained by the remote end shunting each of the server system data are obtained.
The server anomaly detection method according to claim 1, wherein the binary tree group includes a binary tree height, and the anomaly score of each server system data in the binary tree group is calculated based on each of the average path lengths. ,include:

Using a preset mapping relationship to respectively calculate the anomaly score of each server system data corresponding to each of the average path lengths in the binary tree group;

Wherein, the preset mapping relationship is the relationship between the average path length, the binary tree height and the anomaly score.
The server anomaly detection method according to claim 1, characterized in that, after receiving each server system data, it further includes:

Store each server system data in a temporary storage module with queue attributes;

Feature extraction is performed on each server system data, including:

Obtain each server system data from the temporary storage module, and perform feature extraction on each server system data.
The server anomaly detection method according to claim 11, wherein storing each server system data in a temporary storage module with queue attributes includes:

If the temporary storage module with the queue attribute is saturated, each server system data is slidably stored in the temporary storage module.
The server anomaly detection method according to claim 11, wherein the feature data at least includes temperature information, voltage information, memory usage, load information and flow information, and feature extraction is performed on each server system data. , obtain each characteristic data, including:

Obtain each server system data from the temporary storage module, perform feature extraction on each server system data, and obtain the temperature information, the voltage information, the memory usage, the load information and the Describe traffic information.
The server anomaly detection method according to claim 10, characterized in that, after superimposing anomaly detection on each of the server system data in combination with the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model, Also includes:

When there is abnormal data in each of the server system data, the abnormal data in the temporary storage module is eliminated.
The server anomaly detection method according to claim 1, characterized in that feature extraction of each server system data includes:

Randomly select a preset number of server system data from each server system data;

Feature extraction is performed on the selected server system data.
The server anomaly detection method according to claim 1, characterized in that calculating the average path length corresponding to each server system data in the binary tree group composed of each of the binary trees includes:

In the binary tree group composed of each of the binary trees, calculate the distance from the leaf node where the server system data is located in each binary tree to the root node for each server system data, and obtain the distance of the server system data on each binary tree. The path is long;

The average path length on each binary tree is calculated to obtain the average path length corresponding to the server system data.
The server anomaly detection method according to claim 1, characterized in that, when abnormal data is detected in each of the server system data according to each of the average path lengths, the remote end is obtained to divert the data of each of the server systems. The normal data and abnormal data obtained include:

When it is determined that there is an average path length smaller than the preset abnormal path length threshold, each of the normal data and each of the abnormal data obtained by the remote end diverting the server system data are obtained.
A server anomaly detection device, characterized by including:

Data receiving module, used to receive data from each server system;

A feature extraction module, used to extract features from each server system data to obtain each feature data;

A binary tree building module is used to construct a binary tree based on each of the characteristic data to obtain each binary tree;

A path length calculation module, used to calculate the average path length corresponding to each server system data in the binary tree group composed of each of the binary trees;

A data acquisition module, configured to acquire normal data and abnormal data obtained by remote-end shunting of each server system data when abnormal data is detected in each of the server system data according to each of the average path lengths;

A model building module, configured to establish a first multivariate Gaussian distribution model based on each of the normal data, and establish a second multivariate Gaussian distribution model based on each of the abnormal data;

A superposition anomaly detection module is used to perform superposition anomaly detection on each of the server system data in combination with the first multivariate Gaussian distribution model and the second multivariate Gaussian distribution model.
A server anomaly detection device, which is characterized by including:

Memory, used to store computer programs;

A processor, configured to implement the steps of the server anomaly detection method according to any one of claims 1 to 17 when executing the computer program.
A non-volatile readable storage medium, characterized in that a computer program is stored on the non-volatile readable storage medium, and when the computer program is executed by a processor, it implements any one of claims 1 to 17 The steps of the server anomaly detection method.