CN116048912A - Cloud server configuration anomaly identification method based on weak supervision learning - Google Patents
Cloud server configuration anomaly identification method based on weak supervision learning Download PDFInfo
- Publication number
- CN116048912A CN116048912A CN202211636518.5A CN202211636518A CN116048912A CN 116048912 A CN116048912 A CN 116048912A CN 202211636518 A CN202211636518 A CN 202211636518A CN 116048912 A CN116048912 A CN 116048912A
- Authority
- CN
- China
- Prior art keywords
- server
- model
- variables
- time length
- use time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000015654 memory Effects 0.000 claims description 24
- 230000002159 abnormal effect Effects 0.000 claims description 15
- 230000005856 abnormality Effects 0.000 claims description 14
- 238000003066 decision tree Methods 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 6
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 8
- 238000011160 research Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000011985 exploratory data analysis Methods 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000109539 Conchita Species 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 229940070527 tourmaline Drugs 0.000 description 1
- 229910052613 tourmaline Inorganic materials 0.000 description 1
- 239000011032 tourmaline Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning, which mainly comprises the following steps: s1, reading basic configuration information of a server, wherein the basic configuration information of the server comprises discrete variables and non-discrete variables, and reading the service duration of a historical server; s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, and taking the use time length of the history server as the supervision information of the Catboost regression model to obtain a prediction model of the use time length of the server; s3: and taking the non-discrete variable in the basic configuration information of the server and the expected server use time length obtained by using the prediction model of the server use time length as the characteristic variable of the isolated forest model to obtain an anomaly identification model. The invention uses the server use time as the weak supervision signal of the server collocation abnormity problem, thereby improving the expressive force of the model.
Description
Technical Field
The invention relates to the field of data processing, in particular to a cloud server configuration anomaly identification method based on weak supervision learning.
Background
Cloud computing platforms, also referred to as cloud platforms, refer to computing resource-based services that provide computing, networking, and storage capabilities. The computing resources can be divided into hardware resources and software resources, wherein the hardware resources comprise servers, memories, CPUs and the like, and the software resources comprise application software, an integrated development environment and the like. A user can obtain resources meeting requirements from the cloud to a local computer only by sending a request through a network, and all calculation tasks are completed in a remote cloud data center. The cloud computing platform is quite similar to a well-known electronic commerce platform in composition, and has three major elements of a user, a provider and a commodity. In the cloud computing platform, users are users of computing resources, and the crowd mainly comprises scientific researchers (teachers, students and the like), technicians in enterprises (software development and database managers) and part of masses with demands, and generally has certain computer software and hardware knowledge. The provider of cloud computing platforms is the actual owner of computing resources, often a large internet company that owns the computing infrastructure. The commodity of the cloud computing platform comprises four major categories of cloud, network, side and end, wherein a cloud server in the category of cloud is taken as a main component. Cloud servers are generally classified into general cloud servers and GPU cloud servers, and with the development of artificial intelligence, GPU cloud servers are becoming an indispensable popular commodity in order to meet the increasing demand for neural network training.
Recommending proper commodities to users can improve user experience, and is the most important target of a recommendation system. The recommendation system is essentially an information filtering system, and filters out items which are unlikely to act by a user in data through a certain algorithm, so as to recommend the required items to the user. The recommendation system is widely applied in daily life, is as small as the market for binding and selling, is as large as an e-commerce and news website, and affects and changes the life style of people from time to time. The traditional recommendation system calculates the similarity between commodities or between users by a collaborative filtering algorithm based on the behavior of the users, and then recommends. The most common electronic commerce platform recommendation system in the current market takes a multi-way recall architecture as a base stone, and provides a personalized and intelligent recommendation scheme by embedding means related to artificial intelligence such as learning, knowledge graph and the like. However, the cloud computing platform has a certain difference from the e-commerce platform applicable to the traditional recommendation system in various aspects of user behavior, commodity type and the like, so that the recommendation system of the e-commerce platform cannot be completely referred to. Computing platform users typically purchase fewer types of merchandise and use for a longer period of time, and recommending different types of merchandise for the purposes of tourmaline as frequently as e-commerce platforms is not appropriate. In addition, the cloud computing platform has a very critical scenario, that is, some commodities need to be configured by user-defined correlation, for example, a user who purchases a cloud server needs to make a selection on disk capacity, CPU, memory, GPU, and the like. Therefore, when the cloud computing platform recommendation system is built, real data of the platform is required to be combined, various application scenes are focused on, and the artificial intelligence related technology is better implemented on the application level in a scene innovation mode, so that the user experience is improved in an all-around manner, and the cloud computing platform is energized.
In the above-mentioned scenario of server configuration selection, some users do not know whether their own selection is reasonable or not due to the different professional backgrounds of users. For example, a user selects 32GB of memory when buying a GPU cloud server, but only 30GB of disk capacity. Since many users who select multi-core high-memory GPU servers are to train a machine learning model, if such a configuration is put into use directly, various errors due to insufficient disk space will occur quickly.
The conventional methods of abnormality detection are as follows:
1) Rule-based approach: trigger conditions for various types of abnormal conditions are defined manually according to user selectable configurations. For example, manually define rules for the selected memory capacity and disk space anomalies.
2) Statistically based means: and using a statistical index to measure whether the configuration selected by the current user is abnormal compared with most users through a certain continuous variable. For example, the value of a continuous variable is statistically identified as abnormal using the IQR method (Interquartile Range, quarter-distance) or a normal distribution.
However, both methods have certain limitations, the first method has too many subjective factors, the performance is unstable, the time and the labor are wasted, and the second method has too coarse model and weak applicability.
Disclosure of Invention
The invention aims to solve the recommendation problem related to configuration collocation in the scene that a user cloud selects a server on a computing platform, and provides a configuration collocation abnormal recognition scheme based on weak supervision learning.
A cloud server configuration anomaly identification method based on weak supervision learning comprises the following steps:
s1, reading server basic configuration information from historical data, wherein the server basic configuration information comprises discrete variables and non-discrete variables, and reading the using time of a historical server;
s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, taking the use time length of the history server as supervision information of the Catboost regression model, and obtaining a prediction model of the use time length of the server, wherein the prediction model is used for calculating the use time length of an expected server;
s3: the method comprises the steps of taking non-discrete variables in basic configuration information of a server and expected server use time length obtained by using a prediction model of server use time length as characteristic variables of an isolated forest model to obtain an anomaly identification model;
s4: and inputting the server basic configuration information in the data to be tested into a prediction model of the server use time length, taking the obtained expected server use time length as the input of an abnormality recognition model, and taking the non-discrete variable in the server basic configuration information as the input of the abnormality recognition model, so that the server which is recognized as abnormal can be obtained.
Further, the method comprises the steps of,
s1, reading user group information from historical data;
s2, taking the user group information as a characteristic variable of a Catboost regression model;
s4, user population information in the data to be tested is input into a prediction model of the server using duration.
Further, the method comprises the steps of,
and S3, the non-discrete variables in the basic configuration information of the server are independently used as characteristic variables of the isolated forest model, and the variables with the correlation degree lower than the correlation degree threshold value are taken as the mutual proportion as characteristic variables of the isolated forest model.
Further, after logarithmic transformation treatment is carried out on the proportion, the proportion is used as a characteristic variable of the isolated forest model.
Further, the correlation threshold is pearson correlation coefficient 0.25.
Furthermore, the abnormality recognition model generated by the isolated forest model can be judged to be abnormal only when the abnormality recognition step is increased by a condition and the expected server use time length is lower than the server use time length threshold.
Further, the server is used for 168 hours.
Further, the discrete variables of the basic configuration information of the server are a system and a framework, and the non-discrete variables of the basic configuration information of the server are CPU core number, memory capacity, hard disk capacity and network bandwidth.
Further, the super parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.
further, the super parameters of the isolated forest model in S3 include: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.
the beneficial effects are that:
And the key point 2 carries out regression prediction on the service time of the server through server configuration and user population information based on the supervised learning model Catboost. The technical effects are as follows: if the user finds that the selected cloud server is unsuitable in the use process, the existing server is often deleted and a new server is often reselected, so that the servers with unreasonable configuration and collocation are often longer than the servers with reasonable configuration and collocation which are put into use for a long time. Thus, the duration of use of the server may provide a supervisory signal related to the rationality of the configuration collocation. The method utilizes a special processing mechanism of the Catboost model on discrete variables, and comprehensively considers different distribution conditions of the server relative to the user population, the operating system and other discrete variables during use.
And the key point 3 is used for carrying out anomaly identification through the proportional relation among server configuration and the predicted server use time length based on the supervised learning model based on an unsupervised learning model Isolation Forest. The technical effects are as follows: the proportion among the configurations is used for replacing independent configuration information, so that the situation that the number of the servers with high configurations is unreasonable due to the fact that the number of the servers is rare can be avoided; the characteristic variables are introduced into the expected use time, so that the function of weak supervision signals can be achieved. Traditional unsupervised learning models for anomaly detection calculate density or degree of separation (distance) based on the distance between data points, while the contributions of different feature variables in such calculations are the same, and if the meaning or scale difference between the feature variables is large, such calculations are not reasonable. The isolated forest algorithm used in the method does not involve indexes such as distance, density and the like, but isolates abnormal points in the sample by combining different random decision trees for data segmentation. In addition, because the service duration of the server is positively correlated with the rationality of configuration collocation, in order to avoid that a model identifies a sample with excessively high predicted service duration as an abnormal sample, the method improves the original isolated forest algorithm, and when the model identifies the abnormal sample, the condition that the abnormal score in the original method is higher than a certain threshold value and the condition that the predicted service duration is smaller than a certain threshold value is additionally met.
Compared with the prior art, the invention has the following advantages: first, compared with the recognition mode based on rules and statistics, the method uses a machine learning related technology, can comprehensively consider differences of a plurality of characteristic variables and user group behavior habits, and solves the problem that the traditional anomaly recognition model cannot process discrete variables so as to possibly cause important information loss during modeling. Secondly, the invention utilizes the server use time length as a weak supervision signal of the server collocation abnormality problem, and adds the expected server use time length obtained through model prediction into the screening conditions of feature variables and prediction results of non-supervision learning, thereby improving the expressive force of the model. Thirdly, the invention uses an isolated forest algorithm to establish an unsupervised learning model, the algorithm does not need to calculate indexes related to distance and density, and the algorithm is based on an ensembe (combined model) architecture, has linear time complexity, can greatly improve the speed and reduce the system overhead. Each decision tree in the isolated forest algorithm is independently generated, so that the decision tree can be deployed on a large-scale distributed system to accelerate operation, has expansibility compared with the traditional algorithm, is more suitable for large data scenes, and meets the requirement of continuously increasing data quantity.
Drawings
FIG. 1 is a flow chart of a method;
FIG. 2 is a graph of the log-transformed result of the hard disk capacity-memory capacity ratio;
FIG. 3 is a graph of the log conversion result of the hard disk capacity-CPU core number ratio;
FIG. 4 is a graph of log conversion results of memory capacity versus CPU core number ratio;
FIG. 5 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);
FIG. 6 is a graph of server usage time versus log (hard disk capacity: CPU cores).
FIG. 7 is a graph of log conversion results of memory capacity versus CPU core number ratio;
FIG. 8 is a graph of server usage time versus log of scale variables (hard disk capacity: memory capacity);
FIG. 9 is a graph of server usage time versus log (hard disk capacity: CPU cores).
Detailed Description
1. Exploratory data analysis
Exploratory data analysis is an indispensable first step in machine learning modeling. Reasonable data analysis can promote understanding of data, guide design feature variables and selection of models. The following work is mainly performed in the exploratory data analysis stage:
1) Screening of data
The data were sourced from 3147 servers of the department of academy of sciences information OneITLAB platform, of which 2904 servers have been out of service. The goal of the analysis was 1170 plateau server where creation was successful and there was a complete time record. The partial servers are test servers which are only used for testing whether the functions of the platform are normal or not, and are deleted when analysis is performed. The configuration of the intensive study in this method is: CPU core number, memory capacity, network bandwidth, hard disk capacity, GPU number, operating system, architecture, GPU type.
2) User population variability analysis
The user packets contained in OneITLab are: students, laboratory bed users, teachers, scientific research personnel, scientific research team responsible persons, administrators and superadministrators, wherein one user can have multiple identities. Thus when a user with multiple identities creates a server, the server is also counted into the statistics.
User grouping | CPU core number | Memory capacity | Network bandwidth | Hard disk capacity | Number of servers |
Student's study | 6.87 | 15.80 | 1.41 | 95.98 | 717 |
Experimental bed user | 4.15 | 8.95 | 1.09 | 55.46 | 224 |
Teacher's teacher | 5.49 | 13.11 | 1.46 | 79.96 | 1012 |
Responsible person of scientific research team | 7.21 | 17.19 | 10.85 | 130.24 | 1013 |
Administrator(s) | 5.00 | 12.98 | 0.95 | 51.85 | 324 |
Super administrator | 7.94 | 23.21 | 1.78 | 60.16 | 107 |
Table 1: average server configuration for each user group
It can be seen from the above table that there is a certain difference in the average configuration used by the different user packets: the scientific research team responsible person selects higher configuration, and the average configuration selected by the administrator is lower.
In addition, entropy and Gini coefficients (Gini Index) are indicators used to measure the chaos of elements in a collection, and thus can be used to measure the diversity of users in making various configuration choices (higher values represent more diversity):
let element X in set X 1 ,x 2 … has a value of v 1 ,…,v n (n values in total), p in the above formula i Representing an element value v i Probability of (p) i =Pr(x=v i ). H (X) represents an entropy value.
Table 2: configuring entropy value for each user group server
From the above table, it can be seen that some user groups, such as the responsible person of the scientific research team, have more diversified configurations to choose, while the administrator and the laboratory bed have more single configurations to choose.
3) Correlation analysis between configurations
The pearson correlation coefficient (Pearson Correlation Coefficient) is used in statistics to measure the degree of linear correlation between variables X and Y of the two sets of data. It is the ratio of the covariance of two variables to the product of their standard deviation, with values between-1 and 1, the closer to 1 indicating a stronger positive correlation, the closer to-1 indicating a stronger negative correlation, and the equal to 0 indicating an uncorrelation:
wherein ρ is X,Y The pearson correlation coefficient between variables X and Y representing the two sets of data.
Table 3: person correlation coefficient seen by server configuration
There is a positive correlation between configurations as a whole, where the correlation of CPU core number and memory capacity is as high as 0.97, because CPU and memory capacity tend to appear in combination when a user selects a configuration.
Table 4: average server configuration for CPU cores
From the above table, it can be seen that servers with smaller CPU cores are not generally collocated with GPUs, and the number of servers with smaller CPU cores is larger.
2. Server use time length prediction
The server usage duration in this scenario was predicted using the Catboost regression model. The CatBOOST belongs to an integrated learning model, a used lifting method structure takes a regression decision tree as a base model, namely, a regression decision tree with poor expressive force is started, the model effect is lifted by continuously optimizing and iterating according to residual errors, and a plurality of base models are combined to generate a final prediction result. CatBOOST uses One-Hot encoding to process low-radix discrete variables and target variable statistics (Target Statistics) to process high-radix discrete variables, which are more efficient in discrete variable processing than other integrated models, such as random forests and XGBoost.
Discrete variables, which refer to variables that are discrete and that are not meaningful to each other, such as user population, system, architecture, where in conventional algorithms, these discrete variables cannot be handled and thus may lead to problems with loss of important information in modeling. Non-discrete variables refer to other parameters such as CPU core number, memory capacity, hard disk capacity, network bandwidth.
In the patent, basic configuration information of a server is divided into discrete variables and non-discrete variables, such as two configurations of a system and a framework, or called variables, which belong to the discrete variables; the CPU core number, memory capacity, hard disk capacity, network bandwidth, or called variable, belongs to the non-discrete variable.
According to the analysis of the first step, the selected configuration among the user groups is very different, for example, the selection of the user of the experiment bed on the configuration is single, and the average configuration is lower, at the moment, if the group selects the reasonable configuration with higher configuration, the server still has very high probability of belonging to an abnormal state, so that the service time of the server is very short, and therefore, in the model for predicting the service time of the server, the invention takes the user group information as the characteristic variable, and can improve the accuracy of model prediction.
In summary, when the Catboost regression model is used, the history server is used as the supervision signal in this embodiment, and the selected characteristic variables and super parameters are as follows:
1) Feature variable selection:
variable name | Examples of the invention |
CPU core number | 4-core, 8-core, etc |
Memory capacity | 16GB, 32GB, etc |
System and method for controlling a system | Ubuntu or Centros |
Hard disk capacity | 30GB, 100GB, etc |
Network bandwidth | 1M, 1000M, etc |
User population | Teacher and studentResponsible person of scientific research team, etc |
Architecture for a computer system | X86 or ARM |
TABLE 5
2) Setting a model super parameter:
TABLE 6
(note: the super parameters of the model can be adjusted according to the actual scene)
3) Model effect evaluation:
TABLE 7
The formula:
(1) Determining coefficients:
wherein y is i Indicating the i-th actual value of the value,represents the i-th predictive value,/->Representing the average of actual values
r 2 Is a model for measuring whether a model is specific constant (average) under current dataGood criteria, values between 0 and 1, 0 representing equal to the use of average for prediction, approaching 1 means far better than average model.
(2) Weighted average absolute percentage error:
WMAPE is a regression evaluation index that measures non-negative targets, reflects the ratio of error to actual value, and ranges from 0 to infinity, and the closer to 0, the better the model effect.
In the process, as shown in fig. 1, basic configuration information of a server and family group information of a user are selected as characteristic variables, the service time of a history server is used as a supervision signal, and a Catboost regression model is used to obtain a prediction model of the service time of the server, which is used for calculating the service time of an expected server.
3. Abnormality recognition model
The anomaly identification model is generated by using the isolated forest model, and the isolated forest has poor support to discrete variables, so that the characteristic variables of the isolated forest are selected from non-discrete variables.
In the previous analysis in the first step, it can be known that, in the basic configuration information of the server, the memory capacity, the CPU core number and the memory capacity are highly correlated, or are increased or decreased, so that if the configuration information is directly processed without adding any processing, the ability of the model to determine "configuration collocation abnormality" is greatly compromised.
In this embodiment, the pearson correlation coefficient 0.25 is selected as the correlation threshold, and when the server basic configuration information is selected, the variables with the inter-variable correlation lower than the correlation threshold are independently used as the feature variables of the isolated forest model, and the variables with the inter-variable correlation higher than the correlation threshold are taken as the proportion of each other to be used as the feature variables of the isolated forest model.
When the characteristic variable of the isolated forest model is selected, the network bandwidth with low correlation with other configurations is used as the characteristic variable independently, the proportion information among the configurations of the memory capacity, the CPU core number and the memory capacity is used as the characteristic variable, and meanwhile, the expected service life of the server is added as the characteristic variable.
In theory, too high and too low configuration ratios are unreasonable, so we need to "memory capacity: CPU core number "," hard disk capacity: memory capacity "," hard disk capacity: the CPU core number is subjected to logarithmic transformation processing so that the distribution of the three variables is close to normal distribution, thereby facilitating the simultaneous identification of too high or too low configuration proportion, and also making the order of the configuration in proportion unimportant (consistent data distribution). The results of the logarithmic transformation are shown in figures 2 to 9. In addition, since the server usage time period is in positive relation with the server configuration rationality, the distribution of the server usage time period from fig. 5 and 6 shows a trend of low at both ends and high at middle on (log) scale variables log (hard disk capacity: memory capacity) and log (hard disk capacity: CPU core number).
In summary, when using the isolated forest model, the characteristic variables and the super parameters selected in this embodiment are:
1) Feature variable selection:
TABLE 8
2) Setting a model super parameter:
super parameter name | Super parameter value |
Whether to use Bootstrap | Is yes |
Pollution degree | 0.01 |
Maximum feature number | 1.0 |
Number of |
1000 |
Long maximum expected server usage | 168 hours (one week) |
TABLE 9
(note: the super parameters of the model can be adjusted according to the actual scene)
3) Model results:
the prediction result of the model is determined by the anomaly score, and a certain percentage of anomaly score is usually selected as a decision criterion, namely the pollution degree in the model hyper-parameters.
-h (x) represents the depth of the sample x on the tree, E [ h (x) ] represents its average depth on all trees;
-c (n) represents the average path length when constructing a binary tree using n samples, for normalizing E [ h (x) ].
-the score s (x, n) has a value ranging from 0 to 1, wherein a closer to 1 is a greater likelihood of being an outlier.
Some specially tailored server configuration collocations are rare, but users tend to use for long periods of time and therefore should not be identified as anomalous. However, since the unsupervised learning principle is easy to identify the sample with rare characteristic values as abnormal, the scheme innovatively improves the isolated forest in the prediction result generation step: the sample is determined to be abnormal only if both an abnormality score greater than a certain threshold and a predicted server usage period less than a certain threshold must be satisfied. The 168 hours, i.e. the duration of one week, is used herein as a threshold for the duration of server use.
The embodiment selects the mode of directly improving the isolated model to achieve the effect, and the improved isolated forest model can add the maximum value of the expected server use time as a parameter for adjusting the server use time threshold value in the prediction result generation step. The person skilled in the art can add the above effects without using any other way of creative work, so that the abnormality recognition model can determine abnormality only when the abnormality recognition step is added with a condition that the expected service time length of the server is lower than the service time length threshold of the server.
The anomaly identification model uses the result generated by the server using the time-length prediction model as a weak supervision signal, thereby playing the role of comprehensively considering two factors of configuration collocation and using time-length in judging. In a sense, recommending to the user the goods that they are able to use for the longest period of time can be considered a reasonable recommendation, as the time of use is often proportional to the user's satisfaction.
Using the data to be tested, the following server configuration is identified as abnormal by the model of the solution:
table 10: server configuration anomaly recognition result
Claims (10)
1. The cloud server configuration anomaly identification method based on weak supervision learning is characterized by comprising the following steps of:
s1, reading server basic configuration information from historical data, wherein the server basic configuration information comprises discrete variables and non-discrete variables, and reading the using time of a historical server;
s2, taking the basic configuration information of the server as a characteristic variable of the Catboost regression model, taking the use time length of the history server as supervision information of the Catboost regression model, and obtaining a prediction model of the use time length of the server, wherein the prediction model is used for calculating the use time length of an expected server;
s3: the method comprises the steps of taking non-discrete variables in basic configuration information of a server and expected server use time length obtained by using a prediction model of server use time length as characteristic variables of an isolated forest model to obtain an anomaly identification model;
s4: and inputting the server basic configuration information in the data to be tested into a prediction model of the server use time length, taking the obtained expected server use time length as the input of an abnormality recognition model, and taking the non-discrete variable in the server basic configuration information as the input of the abnormality recognition model, so that the server which is recognized as abnormal can be obtained.
2. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein,
s1, reading user group information from historical data;
s2, taking the user group information as a characteristic variable of a Catboost regression model;
s4, user population information in the data to be tested is input into a prediction model of the server using duration.
3. The cloud server configuration anomaly identification method based on weak supervision learning according to claim 1, wherein in the step S3, non-discrete variables in basic configuration information of the server, variables with correlation lower than a correlation threshold are independently used as feature variables of an isolated forest model, and variables with correlation higher than the correlation threshold are taken as the ratio of the variables to each other and used as feature variables of the isolated forest model.
4. The cloud server configuration anomaly identification method based on weak supervision learning of claim 3, wherein the proportion is used as a characteristic variable of an isolated forest model after logarithmic transformation.
5. The cloud server configuration anomaly identification method based on weak supervised learning of claim 3, wherein the correlation threshold is pearson correlation coefficient 0.25.
6. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the anomaly identification model generated by the isolated forest model is determined to be anomaly only when the anomaly identification step is increased by a condition that the expected server use time length is lower than the server use time length threshold value.
7. The cloud server configuration anomaly identification method based on weak supervised learning of claim 6, wherein the server use time length threshold is 168 hours.
8. The cloud server configuration anomaly identification method based on weak supervision learning of claim 1, wherein the discrete variables of the server basic configuration information are a system and a framework, and the non-discrete variables of the server basic configuration information are a CPU core number, a memory capacity, a hard disk capacity and a network bandwidth.
9. The cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the hyper-parameters used by the Catboost regression model in S2 include: iteration number: 1000, decision tree structure: symmetric, L2 regularization intensity: 3, maximum depth of decision tree: 6, learning rate: 0.0496, maximum leaf number: 64.
10. the cloud server configuration anomaly identification method based on weak supervised learning of claim 1, wherein the super parameters of the isolated forest model in S3 comprise: whether to use Bootstrap: the pollution degree is: 0.01, maximum number of features: 1.0, decision tree number: 1000.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211636518.5A CN116048912B (en) | 2022-12-20 | 2022-12-20 | Cloud server configuration anomaly identification method based on weak supervision learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211636518.5A CN116048912B (en) | 2022-12-20 | 2022-12-20 | Cloud server configuration anomaly identification method based on weak supervision learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116048912A true CN116048912A (en) | 2023-05-02 |
CN116048912B CN116048912B (en) | 2024-07-30 |
Family
ID=86130405
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211636518.5A Active CN116048912B (en) | 2022-12-20 | 2022-12-20 | Cloud server configuration anomaly identification method based on weak supervision learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116048912B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116760881A (en) * | 2023-08-17 | 2023-09-15 | 北京智芯微电子科技有限公司 | System configuration method and device of power distribution terminal, storage medium and electronic equipment |
CN117609470A (en) * | 2023-12-08 | 2024-02-27 | 中科南京信息高铁研究院 | Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform |
CN117786236A (en) * | 2023-12-27 | 2024-03-29 | 中科南京信息高铁研究院 | Cloud edge collaborative reasoning and personality learning method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061620A (en) * | 2019-12-27 | 2020-04-24 | 福州林科斯拉信息技术有限公司 | Intelligent detection method and detection system for server abnormity of mixed strategy |
CN112288025A (en) * | 2020-11-03 | 2021-01-29 | 中国平安财产保险股份有限公司 | Abnormal case identification method, device and equipment based on tree structure and storage medium |
CN114118162A (en) * | 2021-12-01 | 2022-03-01 | 盐城工学院 | Bearing fault detection method based on improved deep forest algorithm |
US20220103444A1 (en) * | 2020-09-30 | 2022-03-31 | Mastercard International Incorporated | Methods and systems for predicting time of server failure using server logs and time-series data |
CN115033591A (en) * | 2022-06-01 | 2022-09-09 | 广东技术师范大学 | Intelligent detection method and system for electricity charge data abnormity, storage medium and computer equipment |
CN115359393A (en) * | 2022-08-16 | 2022-11-18 | 武汉东智科技股份有限公司 | Image screen-splash abnormity identification method based on weak supervision learning |
-
2022
- 2022-12-20 CN CN202211636518.5A patent/CN116048912B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061620A (en) * | 2019-12-27 | 2020-04-24 | 福州林科斯拉信息技术有限公司 | Intelligent detection method and detection system for server abnormity of mixed strategy |
US20220103444A1 (en) * | 2020-09-30 | 2022-03-31 | Mastercard International Incorporated | Methods and systems for predicting time of server failure using server logs and time-series data |
CN112288025A (en) * | 2020-11-03 | 2021-01-29 | 中国平安财产保险股份有限公司 | Abnormal case identification method, device and equipment based on tree structure and storage medium |
CN114118162A (en) * | 2021-12-01 | 2022-03-01 | 盐城工学院 | Bearing fault detection method based on improved deep forest algorithm |
CN115033591A (en) * | 2022-06-01 | 2022-09-09 | 广东技术师范大学 | Intelligent detection method and system for electricity charge data abnormity, storage medium and computer equipment |
CN115359393A (en) * | 2022-08-16 | 2022-11-18 | 武汉东智科技股份有限公司 | Image screen-splash abnormity identification method based on weak supervision learning |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116760881A (en) * | 2023-08-17 | 2023-09-15 | 北京智芯微电子科技有限公司 | System configuration method and device of power distribution terminal, storage medium and electronic equipment |
CN116760881B (en) * | 2023-08-17 | 2023-12-22 | 北京智芯微电子科技有限公司 | System configuration method and device of power distribution terminal, storage medium and electronic equipment |
CN117609470A (en) * | 2023-12-08 | 2024-02-27 | 中科南京信息高铁研究院 | Question-answering system based on large language model and knowledge graph, construction method thereof and intelligent data management platform |
CN117786236A (en) * | 2023-12-27 | 2024-03-29 | 中科南京信息高铁研究院 | Cloud edge collaborative reasoning and personality learning method |
CN117786236B (en) * | 2023-12-27 | 2024-08-16 | 中科南京信息高铁研究院 | Cloud edge collaborative reasoning and personality learning method |
Also Published As
Publication number | Publication date |
---|---|
CN116048912B (en) | 2024-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116048912B (en) | Cloud server configuration anomaly identification method based on weak supervision learning | |
CN111401433B (en) | User information acquisition method and device, electronic equipment and storage medium | |
Osman | Data mining techniques | |
CN111179016B (en) | Electricity selling package recommending method, equipment and storage medium | |
CN110837931A (en) | Customer churn prediction method, device and storage medium | |
KR102129962B1 (en) | Predictive device for customer churn using Deep Learning and Boosted Decision Trees and method of predicting customer churn using it | |
CN117151870B (en) | Portrait behavior analysis method and system based on guest group | |
CN111582538A (en) | Community value prediction method and system based on graph neural network | |
CN108304853A (en) | Acquisition methods, device, storage medium and the electronic device for the degree of correlation of playing | |
Xu et al. | Novel key indicators selection method of financial fraud prediction model based on machine learning hybrid mode | |
CN111986027A (en) | Abnormal transaction processing method and device based on artificial intelligence | |
Abbasimehr et al. | A novel time series clustering method with fine-tuned support vector regression for customer behavior analysis | |
CN112529319A (en) | Grading method and device based on multi-dimensional features, computer equipment and storage medium | |
CN115204985A (en) | Shopping behavior prediction method, device, equipment and storage medium | |
CN117453764A (en) | Data mining analysis method | |
CN115829683A (en) | Power integration commodity recommendation method and system based on inverse reward learning optimization | |
CN111309994A (en) | User matching method and device, electronic equipment and readable storage medium | |
CN114519508A (en) | Credit risk assessment method based on time sequence deep learning and legal document information | |
CN112819499A (en) | Information transmission method, information transmission device, server and storage medium | |
Arifin | Telecommunication service subscriber churn likelihood prediction analysis using diverse machine learning model | |
CN116911994A (en) | External trade risk early warning system | |
Becher et al. | Automating exploratory data analysis for efficient data mining | |
CN111506813A (en) | Remote sensing information accurate recommendation method based on user portrait | |
Fitrianto et al. | Development of direct marketing strategy for banking industry: The use of a Chi-squared Automatic Interaction Detector (CHAID) in deposit subscription classification | |
CN115187312A (en) | Customer loss prediction method and system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |