CN116048912A - Cloud server configuration anomaly identification method based on weak supervision learning - Google Patents

Cloud server configuration anomaly identification method based on weak supervision learning Download PDF

Info

Publication number
CN116048912A
CN116048912A CN202211636518.5A CN202211636518A CN116048912A CN 116048912 A CN116048912 A CN 116048912A CN 202211636518 A CN202211636518 A CN 202211636518A CN 116048912 A CN116048912 A CN 116048912A
Authority
CN
China
Prior art keywords
server
model
variables
time length
use time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211636518.5A
Other languages
Chinese (zh)
Other versions
CN116048912B (en
Inventor
田秋雨
唐宏伟
潘志伟
王晓虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Information High Speed Railway Research Institute
Original Assignee
Zhongke Nanjing Information High Speed Railway Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Information High Speed Railway Research Institute filed Critical Zhongke Nanjing Information High Speed Railway Research Institute
Priority to CN202211636518.5A priority Critical patent/CN116048912B/en
Publication of CN116048912A publication Critical patent/CN116048912A/en
Application granted granted Critical
Publication of CN116048912B publication Critical patent/CN116048912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning, which mainly comprises the following steps: s1, reading basic configuration information of a server, wherein the basic configuration information of the server comprises discrete variables and non-discrete variables, and reading the service duration of a historical server; s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, and taking the use time length of the history server as the supervision information of the Catboost regression model to obtain a prediction model of the use time length of the server; s3: and taking the non-discrete variable in the basic configuration information of the server and the expected server use time length obtained by using the prediction model of the server use time length as the characteristic variable of the isolated forest model to obtain an anomaly identification model. The invention uses the server use time as the weak supervision signal of the server collocation abnormity problem, thereby improving the expressive force of the model.

Description

Cloud server configuration anomaly identification method based on weak supervision learning
Technical Field
The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning.
Background
Cloud computing platforms, also referred to as cloud platforms, refer to computing resource-based services that provide computing, networking, and storage capabilities. The computing resources can be divided into hardware resources and software resources, wherein the hardware resources comprise servers, memories, CPUs and the like, and the software resources comprise application software, an integrated development environment and the like. A user can obtain resources meeting requirements from the cloud to a local computer only by sending a request through a network, and all calculation tasks are completed in a remote cloud data center. The cloud computing platform is quite similar to a well-known electronic commerce platform in composition, and has three major elements of a user, a provider and a commodity. In the cloud computing platform, users are users of computing resources, and the crowd mainly comprises scientific researchers (teachers, students and the like), technicians in enterprises (software development and database managers) and part of masses with demands, and generally has certain computer software and hardware knowledge. The provider of cloud computing platforms is the actual owner of computing resources, often a large internet company that owns the computing infrastructure. The commodity of the cloud computing platform comprises four major categories of cloud, network, side and end, wherein a cloud server in the category of cloud is taken as a main component. Cloud servers are generally classified into general cloud servers and GPU cloud servers, and with the development of artificial intelligence, GPU cloud servers are becoming an indispensable popular commodity in order to meet the increasing demand for neural network training.
Recommending proper commodities to users can improve user experience, and is the most important target of a recommendation system. The recommendation system is essentially an information filtering system, and filters out items which are unlikely to act by a user in data through a certain algorithm, so as to recommend the required items to the user. The recommendation system is widely applied in daily life, is as small as the market for binding and selling, is as large as an e-commerce and news website, and affects and changes the life style of people from time to time. The traditional recommendation system calculates the similarity between commodities or between users by a collaborative filtering algorithm based on the behavior of the users, and then recommends. The most common electronic commerce platform recommendation system in the current market takes a multi-way recall architecture as a base stone, and provides a personalized and intelligent recommendation scheme by embedding means related to artificial intelligence such as learning, knowledge graph and the like. However, the cloud computing platform has a certain difference from the e-commerce platform applicable to the traditional recommendation system in various aspects of user behavior, commodity type and the like, so that the recommendation system of the e-commerce platform cannot be completely referred to. Computing platform users typically purchase fewer types of merchandise and use for a longer period of time, and recommending different types of merchandise for the purposes of tourmaline as frequently as e-commerce platforms is not appropriate. In addition, the cloud computing platform has a very critical scenario, that is, some commodities need to be configured by user-defined correlation, for example, a user who purchases a cloud server needs to make a selection on disk capacity, CPU, memory, GPU, and the like. Therefore, when the cloud computing platform recommendation system is built, real data of the platform is required to be combined, various application scenes are focused on, and the artificial intelligence related technology is better implemented on the application level in a scene innovation mode, so that the user experience is improved in an all-around manner, and the cloud computing platform is energized.
In the above-mentioned scenario of server configuration selection, some users do not know whether their own selection is reasonable or not due to the different professional backgrounds of users. For example, a user selects 32GB of memory when buying a GPU cloud server, but only 30GB of disk capacity. Since many users who select multi-core high-memory GPU servers are to train a machine learning model, if such a configuration is put into use directly, various errors due to insufficient disk space will occur quickly.
The conventional methods of abnormality detection are as follows:
1) Rule-based approach: trigger conditions for various types of abnormal conditions are defined manually according to user selectable configurations. For example, manually define rules for the selected memory capacity and disk space anomalies.
2) Statistically based means: and using a statistical index to measure whether the configuration selected by the current user is abnormal compared with most users through a certain continuous variable. For example, the value of a continuous variable is statistically identified as abnormal using the IQR method (Interquartile Range, quarter-distance) or a normal distribution.
However, both methods have certain limitations, the first method has too many subjective factors, the performance is unstable, the time and the labor are wasted, and the second method has too coarse model and weak applicability.
Disclosure of Invention
The invention aims to solve the recommendation problem related to configuration collocation in the scene that a user cloud selects a server on a computing platform, and provides a configuration collocation abnormal recognition scheme based on weak supervision learning.
A cloud server configuration anomaly identification method based on weak supervision learning comprises the following steps:
s1, reading server basic configuration information from historical data, wherein the server basic configuration information comprises discrete variables and non-discrete variables, and reading the using time of a historical server;
s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, taking the use time length of the history server as supervision information of the Catboost regression model, and obtaining a prediction model of the use time length of the server, wherein the prediction model is used for calculating the use time length of an expected server;
s3: the method comprises the steps of taking non-discrete variables in basic configuration information of a server and expected server use time length obtained by using a prediction model of server use time length as characteristic variables of an isolated forest model to obtain an anomaly identification model;
s4: and inputting the server basic configuration information in the data to be tested into a prediction model of the server use time length, taking the obtained expected server use time length as the input of an abnormality recognition model, and taking the non-discrete variable in the server basic configuration information as the input of the abnormality recognition model, so that the server which is recognized as abnormal can be obtained.
Further, the method comprises the steps of,
s1, reading user group information from historical data;
s2, taking the user group information as a characteristic variable of a Catboost regression model;
s4, user population information in the data to be tested is input into a prediction model of the server using duration.
Further, the method comprises the steps of,
and S3, the non-discrete variables in the basic configuration information of the server are independently used as characteristic variables of the isolated forest model, and the variables with the correlation degree lower than the correlation degree threshold value are taken as the mutual proportion as characteristic variables of the isolated forest model.
Further, after logarithmic transformation treatment is carried out on the proportion, the proportion is used as a characteristic variable of the isolated forest model.
Further, the correlation threshold is pearson correlation coefficient 0.25.
Furthermore, the abnormality recognition model generated by the isolated forest model can be judged to be abnormal only when the abnormality recognition step is increased by a condition and the expected server use time length is lower than the server use time length threshold.
Further, the server is used for 168 hours.
Further, the discrete variables of the basic configuration information of the server are a system and a framework, and the non-discrete variables of the basic configuration information of the server are CPU core number, memory capacity, hard disk capacity and network bandwidth.
Further, the super parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.
further, the super parameters of the isolated forest model in S3 include: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.
the beneficial effects are that:
key point 1, using the server time length as weak supervisory signal. The technical effects are as follows: recommending to the user the goods that they can use for the longest period of time may be considered a reasonable recommendation in a sense, as the period of time of use often has a positive correlation with the user's satisfaction. Often, users log off the original server and recreate a new server after finding that the configuration is unsuitable, so that a certain association exists between the service time of the server and the rationality of the configuration. However, since the configuration matching of the servers which are partially put into use for a long time is unreasonable, the server can only be used as a coarse-grained signal which is not exactly supervised in weak supervision study when in use, so that the rationality of the configuration matching of the servers is quantized to a certain extent, but the server cannot be used as a unique judgment standard for whether the configuration of the servers is reasonable. Since the user has not used the server when creating the new server selection configuration, a prediction needs to be made of the duration of time that the user uses the server based on the existing configuration information and the user's own information. The weakly supervised learning in this scheme can be split into two steps (illustrated by keypoint 2 and keypoint 3) of supervised machine learning and unsupervised machine learning when embodied.
And the key point 2 carries out regression prediction on the service time of the server through server configuration and user population information based on the supervised learning model Catboost. The technical effects are as follows: if the user finds that the selected cloud server is unsuitable in the use process, the existing server is often deleted and a new server is often reselected, so that the servers with unreasonable configuration and collocation are often longer than the servers with reasonable configuration and collocation which are put into use for a long time. Thus, the duration of use of the server may provide a supervisory signal related to the rationality of the configuration collocation. The method utilizes a special processing mechanism of the Catboost model on discrete variables, and comprehensively considers different distribution conditions of the server relative to the user population, the operating system and other discrete variables during use.
And the key point 3 is used for carrying out anomaly identification through the proportional relation among server configuration and the predicted server use time length based on the supervised learning model based on an unsupervised learning model Isolation Forest. The technical effects are as follows: the proportion among the configurations is used for replacing independent configuration information, so that the situation that the number of the servers with high configurations is unreasonable due to the fact that the number of the servers is rare can be avoided; the characteristic variables are introduced into the expected use time, so that the function of weak supervision signals can be achieved. Traditional unsupervised learning models for anomaly detection calculate density or degree of separation (distance) based on the distance between data points, while the contributions of different feature variables in such calculations are the same, and if the meaning or scale difference between the feature variables is large, such calculations are not reasonable. The isolated forest algorithm used in the method does not involve indexes such as distance, density and the like, but isolates abnormal points in the sample by combining different random decision trees for data segmentation. In addition, because the service duration of the server is positively correlated with the rationality of configuration collocation, in order to avoid that a model identifies a sample with excessively high predicted service duration as an abnormal sample, the method improves the original isolated forest algorithm, and when the model identifies the abnormal sample, the condition that the abnormal score in the original method is higher than a certain threshold value and the condition that the predicted service duration is smaller than a certain threshold value is additionally met.
Compared with the prior art, the invention has the following advantages: first, compared with the recognition mode based on rules and statistics, the method uses a machine learning related technology, can comprehensively consider differences of a plurality of characteristic variables and user group behavior habits, and solves the problem that the traditional anomaly recognition model cannot process discrete variables so as to possibly cause important information loss during modeling. Secondly, the invention utilizes the server use time length as a weak supervision signal of the server collocation abnormality problem, and adds the expected server use time length obtained through model prediction into the screening conditions of feature variables and prediction results of non-supervision learning, thereby improving the expressive force of the model. Thirdly, the invention uses an isolated forest algorithm to establish an unsupervised learning model, the algorithm does not need to calculate indexes related to distance and density, and the algorithm is based on an ensembe (combined model) architecture, has linear time complexity, can greatly improve the speed and reduce the system overhead. Each decision tree in the isolated forest algorithm is independently generated, so that the decision tree can be deployed on a large-scale distributed system to accelerate operation, has expansibility compared with the traditional algorithm, is more suitable for large data scenes, and meets the requirement of continuously increasing data quantity.
Drawings
FIG. 1 is a flow chart of a method;
FIG. 2 is a graph of the log-transformed result of the hard disk capacity-memory capacity ratio;
FIG. 3 is a graph of the log conversion result of the hard disk capacity-CPU core number ratio;
FIG. 4 is a graph of log conversion results of memory capacity versus CPU core number ratio;
FIG. 5 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);
FIG. 6 is a graph of server usage time versus log (hard disk capacity: CPU cores).
FIG. 7 is a graph of log conversion results of memory capacity versus CPU core number ratio;
FIG. 8 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);
FIG. 9 is a graph of server usage time versus log (hard disk capacity: CPU cores).
Detailed Description
1. Exploratory data analysis
Exploratory data analysis is an indispensable first step in machine learning modeling. Reasonable data analysis can promote understanding of data, guide design feature variables and selection of models. The following work is mainly performed in the exploratory data analysis stage:
1) Screening of data
The data were sourced from 3147 servers of the department of academy of sciences information OneITLAB platform, of which 2904 servers have been out of service. The goal of the analysis was 1170 plateau server where creation was successful and there was a complete time record. The partial servers are test servers which are only used for testing whether the functions of the platform are normal or not, and are deleted when analysis is performed. The configuration of the intensive study in this method is: CPU core number, memory capacity, network bandwidth, hard disk capacity, GPU number, operating system, architecture, GPU type.
2) User population variability analysis
The user packets contained in OneITLab are: students, laboratory bed users, teachers, scientific research personnel, scientific research team responsible persons, administrators and superadministrators, wherein one user can have multiple identities. Thus when a user with multiple identities creates a server, the server is also counted into the statistics.
User grouping CPU core number Memory capacity Network bandwidth Hard disk capacity Number of servers
Student's study 6.87 15.80 1.41 95.98 717
Experimental bed user 4.15 8.95 1.09 55.46 224
Teacher's teacher 5.49 13.11 1.46 79.96 1012
Responsible person of scientific research team 7.21 17.19 10.85 130.24 1013
Administrator(s) 5.00 12.98 0.95 51.85 324
Super administrator 7.94 23.21 1.78 60.16 107
Table 1: average server configuration for each user group
It can be seen from the above table that there is a certain difference in the average configuration used by the different user packets: the scientific research team responsible person selects higher configuration, and the average configuration selected by the administrator is lower.
In addition, entropy and Gini coefficients (Gini Index) are indicators used to measure the chaos of elements in a collection, and thus can be used to measure the diversity of users in making various configuration choices (higher values represent more diversity):
Figure SMS_1
let element X in set X 1 ,x 2 … has a value of v 1 ,…,v n (n values in total), p in the above formula i Representing an element value v i Probability of (p) i =Pr(x=v i ). H (X) represents an entropy value.
Figure SMS_2
Table 2: configuring entropy value for each user group server
From the above table, it can be seen that some user groups, such as the responsible person of the scientific research team, have more diversified configurations to choose, while the administrator and the laboratory bed have more single configurations to choose.
3) Correlation analysis between configurations
The pearson correlation coefficient (Pearson Correlation Coefficient) is used in statistics to measure the degree of linear correlation between variables X and Y of the two sets of data. It is the ratio of the covariance of two variables to the product of their standard deviation, with values between-1 and 1, the closer to 1 indicating a stronger positive correlation, the closer to-1 indicating a stronger negative correlation, and the equal to 0 indicating an uncorrelation:
Figure SMS_3
wherein ρ is X,Y The pearson correlation coefficient between variables X and Y representing the two sets of data.
Figure SMS_4
Table 3: person correlation coefficient seen by server configuration
There is a positive correlation between configurations as a whole, where the correlation of CPU core number and memory capacity is as high as 0.97, because CPU and memory capacity tend to appear in combination when a user selects a configuration.
Figure SMS_5
Table 4: average server configuration for CPU cores
From the above table, it can be seen that servers with smaller CPU cores are not generally collocated with GPUs, and the number of servers with smaller CPU cores is larger.
2. Server use time length prediction
The server usage duration in this scenario was predicted using the Catboost regression model. The CatBOOST belongs to an integrated learning model, a used lifting method structure takes a regression decision tree as a base model, namely, a regression decision tree with poor expressive force is started, the model effect is lifted by continuously optimizing and iterating according to residual errors, and a plurality of base models are combined to generate a final prediction result. CatBOOST uses One-Hot encoding to process low-radix discrete variables and target variable statistics (Target Statistics) to process high-radix discrete variables, which are more efficient in discrete variable processing than other integrated models, such as random forests and XGBoost.
Discrete variables, which refer to variables that are discrete and that are not meaningful to each other, such as user population, system, architecture, where in conventional algorithms, these discrete variables cannot be handled and thus may lead to problems with loss of important information in modeling. Non-discrete variables refer to other parameters such as CPU core number, memory capacity, hard disk capacity, network bandwidth.
In the patent, basic configuration information of a server is divided into discrete variables and non-discrete variables, such as two configurations of a system and a framework, or called variables, which belong to the discrete variables; the CPU core number, memory capacity, hard disk capacity, network bandwidth, or called variable, belongs to the non-discrete variable.
According to the analysis of the first step, the selected configuration among the user groups is very different, for example, the selection of the user of the experiment bed on the configuration is single, and the average configuration is lower, at the moment, if the group selects the reasonable configuration with higher configuration, the server still has very high probability of belonging to an abnormal state, so that the service time of the server is very short, and therefore, in the model for predicting the service time of the server, the invention takes the user group information as the characteristic variable, and can improve the accuracy of model prediction.
In summary, when the Catboost regression model is used, the history server is used as the supervision signal in this embodiment, and the selected characteristic variables and super parameters are as follows:
1) Feature variable selection:
variable name Examples of the invention
CPU core number 4-core, 8-core, etc
Memory capacity 16GB, 32GB, etc
System and method for controlling a system Ubuntu or Centros
Hard disk capacity 30GB, 100GB, etc
Network bandwidth 1M, 1000M, etc
User population Teacher and studentResponsible person of scientific research team, etc
Architecture for a computer system X86 or ARM
TABLE 5
2) Setting a model super parameter:
Figure SMS_6
Figure SMS_7
TABLE 6
(note: the super parameters of the model can be adjusted according to the actual scene)
3) Model effect evaluation:
Figure SMS_8
TABLE 7
The formula:
(1) Determining coefficients:
Figure SMS_9
wherein y is i Indicating the i-th actual value of the value,
Figure SMS_10
represents the i-th predictive value,/->
Figure SMS_11
Representing the average of actual values
r 2 Is a model for measuring whether a model is specific constant (average) under current data
Figure SMS_12
Good criteria, values between 0 and 1, 0 representing equal to the use of average for prediction, approaching 1 means far better than average model.
(2) Weighted average absolute percentage error:
Figure SMS_13
WMAPE is a regression evaluation index that measures non-negative targets, reflects the ratio of error to actual value, and ranges from 0 to infinity, and the closer to 0, the better the model effect.
In the process, as shown in fig. 1, basic configuration information of a server and family group information of a user are selected as characteristic variables, the service time of a history server is used as a supervision signal, and a Catboost regression model is used to obtain a prediction model of the service time of the server, which is used for calculating the service time of an expected server.
3. Abnormality recognition model
The anomaly identification model is generated by using the isolated forest model, and the isolated forest has poor support to discrete variables, so that the characteristic variables of the isolated forest are selected from non-discrete variables.
In the previous analysis in the first step, it can be known that, in the basic configuration information of the server, the memory capacity, the CPU core number and the memory capacity are highly correlated, or are increased or decreased, so that if the configuration information is directly processed without adding any processing, the ability of the model to determine "configuration collocation abnormality" is greatly compromised.
In this embodiment, the pearson correlation coefficient 0.25 is selected as the correlation threshold, and when the server basic configuration information is selected, the variables with the inter-variable correlation lower than the correlation threshold are independently used as the feature variables of the isolated forest model, and the variables with the inter-variable correlation higher than the correlation threshold are taken as the proportion of each other to be used as the feature variables of the isolated forest model.
When the characteristic variable of the isolated forest model is selected, the network bandwidth with low correlation with other configurations is used as the characteristic variable independently, the proportion information among the configurations of the memory capacity, the CPU core number and the memory capacity is used as the characteristic variable, and meanwhile, the expected service life of the server is added as the characteristic variable.
In theory, too high and too low configuration ratios are unreasonable, so we need to "memory capacity: CPU core number "," hard disk capacity: memory capacity "," hard disk capacity: the CPU core number is subjected to logarithmic transformation processing so that the distribution of the three variables is close to normal distribution, thereby facilitating the simultaneous identification of too high or too low configuration proportion, and also making the order of the configuration in proportion unimportant (consistent data distribution). The results of the logarithmic transformation are shown in figures 2 to 9. In addition, since the server usage time period is in positive relation with the server configuration rationality, the distribution of the server usage time period from fig. 5 and 6 shows a trend of low at both ends and high at middle on (log) scale variables log (hard disk capacity: memory capacity) and log (hard disk capacity: CPU core number).
In summary, when using the isolated forest model, the characteristic variables and the super parameters selected in this embodiment are:
1) Feature variable selection:
Figure SMS_14
Figure SMS_15
TABLE 8
2) Setting a model super parameter:
super parameter name Super parameter value
Whether to use Bootstrap Is yes
Pollution degree 0.01
Maximum feature number 1.0
Number of decision trees 1000
Long maximum expected server usage 168 hours (one week)
TABLE 9
(note: the super parameters of the model can be adjusted according to the actual scene)
3) Model results:
the prediction result of the model is determined by the anomaly score, and a certain percentage of anomaly score is usually selected as a decision criterion, namely the pollution degree in the model hyper-parameters.
Figure SMS_16
-h (x) represents the depth of the sample x on the tree, E [ h (x) ] represents its average depth on all trees;
-c (n) represents the average path length when constructing a binary tree using n samples, for normalizing E [ h (x) ].
-the score s (x, n) has a value ranging from 0 to 1, wherein a closer to 1 is a greater likelihood of being an outlier.
Some specially tailored server configuration collocations are rare, but users tend to use for long periods of time and therefore should not be identified as anomalous. However, since the unsupervised learning principle is easy to identify the sample with rare characteristic values as abnormal, the scheme innovatively improves the isolated forest in the prediction result generation step: the sample is determined to be abnormal only if both an abnormality score greater than a certain threshold and a predicted server usage period less than a certain threshold must be satisfied. The 168 hours, i.e. the duration of one week, is used herein as a threshold for the duration of server use.
The embodiment selects the mode of directly improving the isolated model to achieve the effect, and the improved isolated forest model can add the maximum value of the expected server use time as a parameter for adjusting the server use time threshold value in the prediction result generation step. The person skilled in the art can add the above effects without using any other way of creative work, so that the abnormality recognition model can determine abnormality only when the abnormality recognition step is added with a condition that the expected service time length of the server is lower than the service time length threshold of the server.
The anomaly identification model uses the result generated by the server using the time-length prediction model as a weak supervision signal, thereby playing the role of comprehensively considering two factors of configuration collocation and using time-length in judging. In a sense, recommending to the user the goods that they are able to use for the longest period of time can be considered a reasonable recommendation, as the time of use is often proportional to the user's satisfaction.
Using the data to be tested, the following server configuration is identified as abnormal by the model of the solution:
Figure SMS_17
table 10: server configuration anomaly recognition result

Claims (10)

1. The cloud server configuration anomaly identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, reading server basic configuration information from historical data, wherein the server basic configuration information comprises discrete variables and non-discrete variables, and reading the using time of a historical server;
s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, taking the use time length of the history server as supervision information of the Catboost regression model, and obtaining a prediction model of the use time length of the server, wherein the prediction model is used for calculating the use time length of an expected server;
s3: the method comprises the steps of taking non-discrete variables in basic configuration information of a server and expected server use time length obtained by using a prediction model of server use time length as characteristic variables of an isolated forest model to obtain an anomaly identification model;
s4: and inputting the server basic configuration information in the data to be tested into a prediction model of the server use time length, taking the obtained expected server use time length as the input of an abnormality recognition model, and taking the non-discrete variable in the server basic configuration information as the input of the abnormality recognition model, so that the server which is recognized as abnormal can be obtained.
2. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein,
s1, reading user group information from historical data;
s2, taking the user group information as a characteristic variable of a Catboost regression model;
s4, user population information in the data to be tested is input into a prediction model of the server using duration.
3. The cloud server configuration anomaly identification method based on weak supervision learning according to claim 1, wherein in the step S3, non-discrete variables in basic configuration information of the server, variables with correlation lower than a correlation threshold are independently used as feature variables of an isolated forest model, and variables with correlation higher than the correlation threshold are taken as the ratio of the variables to each other and used as feature variables of the isolated forest model.
4. The cloud server configuration anomaly identification method based on weak supervision learning of claim 3, wherein the proportion is used as a characteristic variable of an isolated forest model after logarithmic transformation.
5. The cloud server configuration anomaly identification method based on weak supervised learning of claim 3, wherein the correlation threshold is pearson correlation coefficient 0.25.
6. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the anomaly identification model generated by the isolated forest model is determined to be anomaly only when the anomaly identification step is increased by a condition that the expected server use time length is lower than the server use time length threshold value.
7. The cloud server configuration anomaly identification method based on weak supervised learning of claim 6, wherein the server use time length threshold is 168 hours.
8. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the discrete variables of the server basic configuration information are a system and a framework, and the non-discrete variables of the server basic configuration information are a CPU core number, a memory capacity, a hard disk capacity and a network bandwidth.
9. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the hyper-parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.
10. the cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the super parameters of the isolated forest model in S3 comprise: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.
CN202211636518.5A 2022-12-20 2022-12-20 Cloud server configuration anomaly identification method based on weak supervision learning Active CN116048912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211636518.5A CN116048912B (en) 2022-12-20 2022-12-20 Cloud server configuration anomaly identification method based on weak supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211636518.5A CN116048912B (en) 2022-12-20 2022-12-20 Cloud server configuration anomaly identification method based on weak supervision learning

Publications (2)

Publication Number Publication Date
CN116048912A true CN116048912A (en) 2023-05-02
CN116048912B CN116048912B (en) 2024-07-30

Family

ID=86130405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211636518.5A Active CN116048912B (en) 2022-12-20 2022-12-20 Cloud server configuration anomaly identification method based on weak supervision learning

Country Status (1)

Country Link
CN (1) CN116048912B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116760881A (en) * 2023-08-17 2023-09-15 北京智芯微电子科技有限公司 System configuration method and device of power distribution terminal, storage medium and electronic equipment
CN117609470A (en) * 2023-12-08 2024-02-27 中科南京信息高铁研究院 Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform
CN117786236A (en) * 2023-12-27 2024-03-29 中科南京信息高铁研究院 Cloud edge collaborative reasoning and personality learning method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061620A (en) * 2019-12-27 2020-04-24 福州林科斯拉信息技术有限公司 Intelligent detection method and detection system for server abnormity of mixed strategy
CN112288025A (en) * 2020-11-03 2021-01-29 中国平安财产保险股份有限公司 Abnormal case identification method, device and equipment based on tree structure and storage medium
CN114118162A (en) * 2021-12-01 2022-03-01 盐城工学院 Bearing fault detection method based on improved deep forest algorithm
US20220103444A1 (en) * 2020-09-30 2022-03-31 Mastercard International Incorporated Methods and systems for predicting time of server failure using server logs and time-series data
CN115033591A (en) * 2022-06-01 2022-09-09 广东技术师范大学 Intelligent detection method and system for electricity charge data abnormity, storage medium and computer equipment
CN115359393A (en) * 2022-08-16 2022-11-18 武汉东智科技股份有限公司 Image screen-splash abnormity identification method based on weak supervision learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061620A (en) * 2019-12-27 2020-04-24 福州林科斯拉信息技术有限公司 Intelligent detection method and detection system for server abnormity of mixed strategy
US20220103444A1 (en) * 2020-09-30 2022-03-31 Mastercard International Incorporated Methods and systems for predicting time of server failure using server logs and time-series data
CN112288025A (en) * 2020-11-03 2021-01-29 中国平安财产保险股份有限公司 Abnormal case identification method, device and equipment based on tree structure and storage medium
CN114118162A (en) * 2021-12-01 2022-03-01 盐城工学院 Bearing fault detection method based on improved deep forest algorithm
CN115033591A (en) * 2022-06-01 2022-09-09 广东技术师范大学 Intelligent detection method and system for electricity charge data abnormity, storage medium and computer equipment
CN115359393A (en) * 2022-08-16 2022-11-18 武汉东智科技股份有限公司 Image screen-splash abnormity identification method based on weak supervision learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116760881A (en) * 2023-08-17 2023-09-15 北京智芯微电子科技有限公司 System configuration method and device of power distribution terminal, storage medium and electronic equipment
CN116760881B (en) * 2023-08-17 2023-12-22 北京智芯微电子科技有限公司 System configuration method and device of power distribution terminal, storage medium and electronic equipment
CN117609470A (en) * 2023-12-08 2024-02-27 中科南京信息高铁研究院 Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform
CN117786236A (en) * 2023-12-27 2024-03-29 中科南京信息高铁研究院 Cloud edge collaborative reasoning and personality learning method
CN117786236B (en) * 2023-12-27 2024-08-16 中科南京信息高铁研究院 Cloud edge collaborative reasoning and personality learning method

Also Published As

Publication number Publication date
CN116048912B (en) 2024-07-30

Similar Documents

Publication Publication Date Title
CN116048912B (en) Cloud server configuration anomaly identification method based on weak supervision learning
CN111401433B (en) User information acquisition method and device, electronic equipment and storage medium
Osman Data mining techniques
CN111179016B (en) Electricity selling package recommending method, equipment and storage medium
CN110837931A (en) Customer churn prediction method, device and storage medium
KR102129962B1 (en) Predictive device for customer churn using Deep Learning and Boosted Decision Trees and method of predicting customer churn using it
CN117151870B (en) Portrait behavior analysis method and system based on guest group
CN111582538A (en) Community value prediction method and system based on graph neural network
CN108304853A (en) Acquisition methods, device, storage medium and the electronic device for the degree of correlation of playing
Xu et al. Novel key indicators selection method of financial fraud prediction model based on machine learning hybrid mode
CN111986027A (en) Abnormal transaction processing method and device based on artificial intelligence
Abbasimehr et al. A novel time series clustering method with fine-tuned support vector regression for customer behavior analysis
CN112529319A (en) Grading method and device based on multi-dimensional features, computer equipment and storage medium
CN115204985A (en) Shopping behavior prediction method, device, equipment and storage medium
CN117453764A (en) Data mining analysis method
CN115829683A (en) Power integration commodity recommendation method and system based on inverse reward learning optimization
CN111309994A (en) User matching method and device, electronic equipment and readable storage medium
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
CN112819499A (en) Information transmission method, information transmission device, server and storage medium
Arifin Telecommunication service subscriber churn likelihood prediction analysis using diverse machine learning model
CN116911994A (en) External trade risk early warning system
Becher et al. Automating exploratory data analysis for efficient data mining
CN111506813A (en) Remote sensing information accurate recommendation method based on user portrait
Fitrianto et al. Development of direct marketing strategy for banking industry: The use of a Chi-squared Automatic Interaction Detector (CHAID) in deposit subscription classification
CN115187312A (en) Customer loss prediction method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant