CN116048912A

CN116048912A - Cloud server configuration anomaly identification method based on weak supervision learning

Info

Publication number: CN116048912A
Application number: CN202211636518.5A
Authority: CN
Inventors: 田秋雨; 唐宏伟; 潘志伟; 王晓虹
Original assignee: Zhongke Nanjing Information High Speed Railway Research Institute
Current assignee: Zhongke Nanjing Information High Speed Railway Research Institute
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-05-02
Anticipated expiration: 2042-12-20
Also published as: CN116048912B

Abstract

The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning, which mainly comprises the following steps: s1, reading basic configuration information of a server, wherein the basic configuration information of the server comprises discrete variables and non-discrete variables, and reading the service duration of a historical server; s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, and taking the use time length of the history server as the supervision information of the Catboost regression model to obtain a prediction model of the use time length of the server; s3: and taking the non-discrete variable in the basic configuration information of the server and the expected server use time length obtained by using the prediction model of the server use time length as the characteristic variable of the isolated forest model to obtain an anomaly identification model. The invention uses the server use time as the weak supervision signal of the server collocation abnormity problem, thereby improving the expressive force of the model.

Description

Cloud server configuration anomaly identification method based on weak supervision learning

Technical Field

The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning.

Background

Cloud computing platforms, also referred to as cloud platforms, refer to computing resource-based services that provide computing, networking, and storage capabilities. The computing resources can be divided into hardware resources and software resources, wherein the hardware resources comprise servers, memories, CPUs and the like, and the software resources comprise application software, an integrated development environment and the like. A user can obtain resources meeting requirements from the cloud to a local computer only by sending a request through a network, and all calculation tasks are completed in a remote cloud data center. The cloud computing platform is quite similar to a well-known electronic commerce platform in composition, and has three major elements of a user, a provider and a commodity. In the cloud computing platform, users are users of computing resources, and the crowd mainly comprises scientific researchers (teachers, students and the like), technicians in enterprises (software development and database managers) and part of masses with demands, and generally has certain computer software and hardware knowledge. The provider of cloud computing platforms is the actual owner of computing resources, often a large internet company that owns the computing infrastructure. The commodity of the cloud computing platform comprises four major categories of cloud, network, side and end, wherein a cloud server in the category of cloud is taken as a main component. Cloud servers are generally classified into general cloud servers and GPU cloud servers, and with the development of artificial intelligence, GPU cloud servers are becoming an indispensable popular commodity in order to meet the increasing demand for neural network training.

Recommending proper commodities to users can improve user experience, and is the most important target of a recommendation system. The recommendation system is essentially an information filtering system, and filters out items which are unlikely to act by a user in data through a certain algorithm, so as to recommend the required items to the user. The recommendation system is widely applied in daily life, is as small as the market for binding and selling, is as large as an e-commerce and news website, and affects and changes the life style of people from time to time. The traditional recommendation system calculates the similarity between commodities or between users by a collaborative filtering algorithm based on the behavior of the users, and then recommends. The most common electronic commerce platform recommendation system in the current market takes a multi-way recall architecture as a base stone, and provides a personalized and intelligent recommendation scheme by embedding means related to artificial intelligence such as learning, knowledge graph and the like. However, the cloud computing platform has a certain difference from the e-commerce platform applicable to the traditional recommendation system in various aspects of user behavior, commodity type and the like, so that the recommendation system of the e-commerce platform cannot be completely referred to. Computing platform users typically purchase fewer types of merchandise and use for a longer period of time, and recommending different types of merchandise for the purposes of tourmaline as frequently as e-commerce platforms is not appropriate. In addition, the cloud computing platform has a very critical scenario, that is, some commodities need to be configured by user-defined correlation, for example, a user who purchases a cloud server needs to make a selection on disk capacity, CPU, memory, GPU, and the like. Therefore, when the cloud computing platform recommendation system is built, real data of the platform is required to be combined, various application scenes are focused on, and the artificial intelligence related technology is better implemented on the application level in a scene innovation mode, so that the user experience is improved in an all-around manner, and the cloud computing platform is energized.

In the above-mentioned scenario of server configuration selection, some users do not know whether their own selection is reasonable or not due to the different professional backgrounds of users. For example, a user selects 32GB of memory when buying a GPU cloud server, but only 30GB of disk capacity. Since many users who select multi-core high-memory GPU servers are to train a machine learning model, if such a configuration is put into use directly, various errors due to insufficient disk space will occur quickly.

The conventional methods of abnormality detection are as follows:

1) Rule-based approach: trigger conditions for various types of abnormal conditions are defined manually according to user selectable configurations. For example, manually define rules for the selected memory capacity and disk space anomalies.

2) Statistically based means: and using a statistical index to measure whether the configuration selected by the current user is abnormal compared with most users through a certain continuous variable. For example, the value of a continuous variable is statistically identified as abnormal using the IQR method (Interquartile Range, quarter-distance) or a normal distribution.

However, both methods have certain limitations, the first method has too many subjective factors, the performance is unstable, the time and the labor are wasted, and the second method has too coarse model and weak applicability.

Disclosure of Invention

The invention aims to solve the recommendation problem related to configuration collocation in the scene that a user cloud selects a server on a computing platform, and provides a configuration collocation abnormal recognition scheme based on weak supervision learning.

A cloud server configuration anomaly identification method based on weak supervision learning comprises the following steps:

s1, reading server basic configuration information from historical data, wherein the server basic configuration information comprises discrete variables and non-discrete variables, and reading the using time of a historical server;

s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, taking the use time length of the history server as supervision information of the Catboost regression model, and obtaining a prediction model of the use time length of the server, wherein the prediction model is used for calculating the use time length of an expected server;

s3: the method comprises the steps of taking non-discrete variables in basic configuration information of a server and expected server use time length obtained by using a prediction model of server use time length as characteristic variables of an isolated forest model to obtain an anomaly identification model;

s4: and inputting the server basic configuration information in the data to be tested into a prediction model of the server use time length, taking the obtained expected server use time length as the input of an abnormality recognition model, and taking the non-discrete variable in the server basic configuration information as the input of the abnormality recognition model, so that the server which is recognized as abnormal can be obtained.

Further, the method comprises the steps of,

s1, reading user group information from historical data;

s2, taking the user group information as a characteristic variable of a Catboost regression model;

s4, user population information in the data to be tested is input into a prediction model of the server using duration.

Further, the method comprises the steps of,

and S3, the non-discrete variables in the basic configuration information of the server are independently used as characteristic variables of the isolated forest model, and the variables with the correlation degree lower than the correlation degree threshold value are taken as the mutual proportion as characteristic variables of the isolated forest model.

Further, after logarithmic transformation treatment is carried out on the proportion, the proportion is used as a characteristic variable of the isolated forest model.

Further, the correlation threshold is pearson correlation coefficient 0.25.

Furthermore, the abnormality recognition model generated by the isolated forest model can be judged to be abnormal only when the abnormality recognition step is increased by a condition and the expected server use time length is lower than the server use time length threshold.

Further, the server is used for 168 hours.

Further, the discrete variables of the basic configuration information of the server are a system and a framework, and the non-discrete variables of the basic configuration information of the server are CPU core number, memory capacity, hard disk capacity and network bandwidth.

Further, the super parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.

further, the super parameters of the isolated forest model in S3 include: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.

the beneficial effects are that:

key point 1, using the server time length as weak supervisory signal. The technical effects are as follows: recommending to the user the goods that they can use for the longest period of time may be considered a reasonable recommendation in a sense, as the period of time of use often has a positive correlation with the user's satisfaction. Often, users log off the original server and recreate a new server after finding that the configuration is unsuitable, so that a certain association exists between the service time of the server and the rationality of the configuration. However, since the configuration matching of the servers which are partially put into use for a long time is unreasonable, the server can only be used as a coarse-grained signal which is not exactly supervised in weak supervision study when in use, so that the rationality of the configuration matching of the servers is quantized to a certain extent, but the server cannot be used as a unique judgment standard for whether the configuration of the servers is reasonable. Since the user has not used the server when creating the new server selection configuration, a prediction needs to be made of the duration of time that the user uses the server based on the existing configuration information and the user's own information. The weakly supervised learning in this scheme can be split into two steps (illustrated by keypoint 2 and keypoint 3) of supervised machine learning and unsupervised machine learning when embodied.

And the key point 2 carries out regression prediction on the service time of the server through server configuration and user population information based on the supervised learning model Catboost. The technical effects are as follows: if the user finds that the selected cloud server is unsuitable in the use process, the existing server is often deleted and a new server is often reselected, so that the servers with unreasonable configuration and collocation are often longer than the servers with reasonable configuration and collocation which are put into use for a long time. Thus, the duration of use of the server may provide a supervisory signal related to the rationality of the configuration collocation. The method utilizes a special processing mechanism of the Catboost model on discrete variables, and comprehensively considers different distribution conditions of the server relative to the user population, the operating system and other discrete variables during use.

And the key point 3 is used for carrying out anomaly identification through the proportional relation among server configuration and the predicted server use time length based on the supervised learning model based on an unsupervised learning model Isolation Forest. The technical effects are as follows: the proportion among the configurations is used for replacing independent configuration information, so that the situation that the number of the servers with high configurations is unreasonable due to the fact that the number of the servers is rare can be avoided; the characteristic variables are introduced into the expected use time, so that the function of weak supervision signals can be achieved. Traditional unsupervised learning models for anomaly detection calculate density or degree of separation (distance) based on the distance between data points, while the contributions of different feature variables in such calculations are the same, and if the meaning or scale difference between the feature variables is large, such calculations are not reasonable. The isolated forest algorithm used in the method does not involve indexes such as distance, density and the like, but isolates abnormal points in the sample by combining different random decision trees for data segmentation. In addition, because the service duration of the server is positively correlated with the rationality of configuration collocation, in order to avoid that a model identifies a sample with excessively high predicted service duration as an abnormal sample, the method improves the original isolated forest algorithm, and when the model identifies the abnormal sample, the condition that the abnormal score in the original method is higher than a certain threshold value and the condition that the predicted service duration is smaller than a certain threshold value is additionally met.

Compared with the prior art, the invention has the following advantages: first, compared with the recognition mode based on rules and statistics, the method uses a machine learning related technology, can comprehensively consider differences of a plurality of characteristic variables and user group behavior habits, and solves the problem that the traditional anomaly recognition model cannot process discrete variables so as to possibly cause important information loss during modeling. Secondly, the invention utilizes the server use time length as a weak supervision signal of the server collocation abnormality problem, and adds the expected server use time length obtained through model prediction into the screening conditions of feature variables and prediction results of non-supervision learning, thereby improving the expressive force of the model. Thirdly, the invention uses an isolated forest algorithm to establish an unsupervised learning model, the algorithm does not need to calculate indexes related to distance and density, and the algorithm is based on an ensembe (combined model) architecture, has linear time complexity, can greatly improve the speed and reduce the system overhead. Each decision tree in the isolated forest algorithm is independently generated, so that the decision tree can be deployed on a large-scale distributed system to accelerate operation, has expansibility compared with the traditional algorithm, is more suitable for large data scenes, and meets the requirement of continuously increasing data quantity.

Drawings

FIG. 1 is a flow chart of a method;

FIG. 2 is a graph of the log-transformed result of the hard disk capacity-memory capacity ratio;

FIG. 3 is a graph of the log conversion result of the hard disk capacity-CPU core number ratio;

FIG. 4 is a graph of log conversion results of memory capacity versus CPU core number ratio;

FIG. 5 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);

FIG. 6 is a graph of server usage time versus log (hard disk capacity: CPU cores).

FIG. 7 is a graph of log conversion results of memory capacity versus CPU core number ratio;

FIG. 8 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);

FIG. 9 is a graph of server usage time versus log (hard disk capacity: CPU cores).

Detailed Description

1. Exploratory data analysis

Exploratory data analysis is an indispensable first step in machine learning modeling. Reasonable data analysis can promote understanding of data, guide design feature variables and selection of models. The following work is mainly performed in the exploratory data analysis stage:

1) Screening of data

The data were sourced from 3147 servers of the department of academy of sciences information OneITLAB platform, of which 2904 servers have been out of service. The goal of the analysis was 1170 plateau server where creation was successful and there was a complete time record. The partial servers are test servers which are only used for testing whether the functions of the platform are normal or not, and are deleted when analysis is performed. The configuration of the intensive study in this method is: CPU core number, memory capacity, network bandwidth, hard disk capacity, GPU number, operating system, architecture, GPU type.

2) User population variability analysis

The user packets contained in OneITLab are: students, laboratory bed users, teachers, scientific research personnel, scientific research team responsible persons, administrators and superadministrators, wherein one user can have multiple identities. Thus when a user with multiple identities creates a server, the server is also counted into the statistics.

User grouping	CPU core number	Memory capacity	Network bandwidth	Hard disk capacity	Number of servers
						Student's study	6.87	15.80	1.41	95.98	717
Experimental bed user	4.15	8.95	1.09	55.46	224
						Teacher's teacher	5.49	13.11	1.46	79.96	1012
Responsible person of scientific research team	7.21	17.19	10.85	130.24	1013
						Administrator(s)	5.00	12.98	0.95	51.85	324
Super administrator	7.94	23.21	1.78	60.16	107

Table 1: average server configuration for each user group

It can be seen from the above table that there is a certain difference in the average configuration used by the different user packets: the scientific research team responsible person selects higher configuration, and the average configuration selected by the administrator is lower.

In addition, entropy and Gini coefficients (Gini Index) are indicators used to measure the chaos of elements in a collection, and thus can be used to measure the diversity of users in making various configuration choices (higher values represent more diversity):

let element X in set X ₁ ,x ₂ … has a value of v ₁ ,…,v _n (n values in total), p in the above formula _i Representing an element value v _i Probability of (p) _i ＝Pr(x＝v _i ). H (X) represents an entropy value.

Table 2: configuring entropy value for each user group server

From the above table, it can be seen that some user groups, such as the responsible person of the scientific research team, have more diversified configurations to choose, while the administrator and the laboratory bed have more single configurations to choose.

3) Correlation analysis between configurations

The pearson correlation coefficient (Pearson Correlation Coefficient) is used in statistics to measure the degree of linear correlation between variables X and Y of the two sets of data. It is the ratio of the covariance of two variables to the product of their standard deviation, with values between-1 and 1, the closer to 1 indicating a stronger positive correlation, the closer to-1 indicating a stronger negative correlation, and the equal to 0 indicating an uncorrelation:

wherein ρ is _X,Y The pearson correlation coefficient between variables X and Y representing the two sets of data.

Table 3: person correlation coefficient seen by server configuration

There is a positive correlation between configurations as a whole, where the correlation of CPU core number and memory capacity is as high as 0.97, because CPU and memory capacity tend to appear in combination when a user selects a configuration.

Table 4: average server configuration for CPU cores

From the above table, it can be seen that servers with smaller CPU cores are not generally collocated with GPUs, and the number of servers with smaller CPU cores is larger.

2. Server use time length prediction

The server usage duration in this scenario was predicted using the Catboost regression model. The CatBOOST belongs to an integrated learning model, a used lifting method structure takes a regression decision tree as a base model, namely, a regression decision tree with poor expressive force is started, the model effect is lifted by continuously optimizing and iterating according to residual errors, and a plurality of base models are combined to generate a final prediction result. CatBOOST uses One-Hot encoding to process low-radix discrete variables and target variable statistics (Target Statistics) to process high-radix discrete variables, which are more efficient in discrete variable processing than other integrated models, such as random forests and XGBoost.

Discrete variables, which refer to variables that are discrete and that are not meaningful to each other, such as user population, system, architecture, where in conventional algorithms, these discrete variables cannot be handled and thus may lead to problems with loss of important information in modeling. Non-discrete variables refer to other parameters such as CPU core number, memory capacity, hard disk capacity, network bandwidth.

In the patent, basic configuration information of a server is divided into discrete variables and non-discrete variables, such as two configurations of a system and a framework, or called variables, which belong to the discrete variables; the CPU core number, memory capacity, hard disk capacity, network bandwidth, or called variable, belongs to the non-discrete variable.

According to the analysis of the first step, the selected configuration among the user groups is very different, for example, the selection of the user of the experiment bed on the configuration is single, and the average configuration is lower, at the moment, if the group selects the reasonable configuration with higher configuration, the server still has very high probability of belonging to an abnormal state, so that the service time of the server is very short, and therefore, in the model for predicting the service time of the server, the invention takes the user group information as the characteristic variable, and can improve the accuracy of model prediction.

In summary, when the Catboost regression model is used, the history server is used as the supervision signal in this embodiment, and the selected characteristic variables and super parameters are as follows:

1) Feature variable selection:

variable name	Examples of the invention
		CPU core number	4-core, 8-core, etc
Memory capacity	16GB, 32GB, etc
		System and method for controlling a system	Ubuntu or Centros
Hard disk capacity	30GB, 100GB, etc
		Network bandwidth	1M, 1000M, etc
User population	Teacher and studentResponsible person of scientific research team, etc
		Architecture for a computer system	X86 or ARM

TABLE 5

2) Setting a model super parameter:

TABLE 6

(note: the super parameters of the model can be adjusted according to the actual scene)

3) Model effect evaluation:

TABLE 7

The formula:

(1) Determining coefficients:

wherein y is _i Indicating the i-th actual value of the value,

represents the i-th predictive value,/->

Representing the average of actual values

r ² Is a model for measuring whether a model is specific constant (average) under current data

Good criteria, values between 0 and 1, 0 representing equal to the use of average for prediction, approaching 1 means far better than average model.

(2) Weighted average absolute percentage error:

WMAPE is a regression evaluation index that measures non-negative targets, reflects the ratio of error to actual value, and ranges from 0 to infinity, and the closer to 0, the better the model effect.

In the process, as shown in fig. 1, basic configuration information of a server and family group information of a user are selected as characteristic variables, the service time of a history server is used as a supervision signal, and a Catboost regression model is used to obtain a prediction model of the service time of the server, which is used for calculating the service time of an expected server.

3. Abnormality recognition model

The anomaly identification model is generated by using the isolated forest model, and the isolated forest has poor support to discrete variables, so that the characteristic variables of the isolated forest are selected from non-discrete variables.

In the previous analysis in the first step, it can be known that, in the basic configuration information of the server, the memory capacity, the CPU core number and the memory capacity are highly correlated, or are increased or decreased, so that if the configuration information is directly processed without adding any processing, the ability of the model to determine "configuration collocation abnormality" is greatly compromised.

In this embodiment, the pearson correlation coefficient 0.25 is selected as the correlation threshold, and when the server basic configuration information is selected, the variables with the inter-variable correlation lower than the correlation threshold are independently used as the feature variables of the isolated forest model, and the variables with the inter-variable correlation higher than the correlation threshold are taken as the proportion of each other to be used as the feature variables of the isolated forest model.

When the characteristic variable of the isolated forest model is selected, the network bandwidth with low correlation with other configurations is used as the characteristic variable independently, the proportion information among the configurations of the memory capacity, the CPU core number and the memory capacity is used as the characteristic variable, and meanwhile, the expected service life of the server is added as the characteristic variable.

In theory, too high and too low configuration ratios are unreasonable, so we need to "memory capacity: CPU core number "," hard disk capacity: memory capacity "," hard disk capacity: the CPU core number is subjected to logarithmic transformation processing so that the distribution of the three variables is close to normal distribution, thereby facilitating the simultaneous identification of too high or too low configuration proportion, and also making the order of the configuration in proportion unimportant (consistent data distribution). The results of the logarithmic transformation are shown in figures 2 to 9. In addition, since the server usage time period is in positive relation with the server configuration rationality, the distribution of the server usage time period from fig. 5 and 6 shows a trend of low at both ends and high at middle on (log) scale variables log (hard disk capacity: memory capacity) and log (hard disk capacity: CPU core number).

In summary, when using the isolated forest model, the characteristic variables and the super parameters selected in this embodiment are:

1) Feature variable selection:

TABLE 8

2) Setting a model super parameter:

super parameter name	Super parameter value
		Whether to use Bootstrap	Is yes
Pollution degree	0.01
		Maximum feature number	1.0
Number of decision trees	1000
		Long maximum expected server usage	168 hours (one week)

TABLE 9

3) Model results:

the prediction result of the model is determined by the anomaly score, and a certain percentage of anomaly score is usually selected as a decision criterion, namely the pollution degree in the model hyper-parameters.

-h (x) represents the depth of the sample x on the tree, E [ h (x) ] represents its average depth on all trees;

-c (n) represents the average path length when constructing a binary tree using n samples, for normalizing E [ h (x) ].

-the score s (x, n) has a value ranging from 0 to 1, wherein a closer to 1 is a greater likelihood of being an outlier.

Some specially tailored server configuration collocations are rare, but users tend to use for long periods of time and therefore should not be identified as anomalous. However, since the unsupervised learning principle is easy to identify the sample with rare characteristic values as abnormal, the scheme innovatively improves the isolated forest in the prediction result generation step: the sample is determined to be abnormal only if both an abnormality score greater than a certain threshold and a predicted server usage period less than a certain threshold must be satisfied. The 168 hours, i.e. the duration of one week, is used herein as a threshold for the duration of server use.

The embodiment selects the mode of directly improving the isolated model to achieve the effect, and the improved isolated forest model can add the maximum value of the expected server use time as a parameter for adjusting the server use time threshold value in the prediction result generation step. The person skilled in the art can add the above effects without using any other way of creative work, so that the abnormality recognition model can determine abnormality only when the abnormality recognition step is added with a condition that the expected service time length of the server is lower than the service time length threshold of the server.

The anomaly identification model uses the result generated by the server using the time-length prediction model as a weak supervision signal, thereby playing the role of comprehensively considering two factors of configuration collocation and using time-length in judging. In a sense, recommending to the user the goods that they are able to use for the longest period of time can be considered a reasonable recommendation, as the time of use is often proportional to the user's satisfaction.

Using the data to be tested, the following server configuration is identified as abnormal by the model of the solution:

table 10: server configuration anomaly recognition result

Claims

1. The cloud server configuration anomaly identification method based on weak supervision learning is characterized by comprising the following steps of:

2. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein,

s1, reading user group information from historical data;

3. The cloud server configuration anomaly identification method based on weak supervision learning according to claim 1, wherein in the step S3, non-discrete variables in basic configuration information of the server, variables with correlation lower than a correlation threshold are independently used as feature variables of an isolated forest model, and variables with correlation higher than the correlation threshold are taken as the ratio of the variables to each other and used as feature variables of the isolated forest model.

4. The cloud server configuration anomaly identification method based on weak supervision learning of claim 3, wherein the proportion is used as a characteristic variable of an isolated forest model after logarithmic transformation.

5. The cloud server configuration anomaly identification method based on weak supervised learning of claim 3, wherein the correlation threshold is pearson correlation coefficient 0.25.

6. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the anomaly identification model generated by the isolated forest model is determined to be anomaly only when the anomaly identification step is increased by a condition that the expected server use time length is lower than the server use time length threshold value.

7. The cloud server configuration anomaly identification method based on weak supervised learning of claim 6, wherein the server use time length threshold is 168 hours.

8. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the discrete variables of the server basic configuration information are a system and a framework, and the non-discrete variables of the server basic configuration information are a CPU core number, a memory capacity, a hard disk capacity and a network bandwidth.

9. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the hyper-parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.

10. the cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the super parameters of the isolated forest model in S3 comprise: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.