CN115576778A

CN115576778A - Server predictive maintenance model method based on machine learning

Info

Publication number: CN115576778A
Application number: CN202211299713.3A
Authority: CN
Inventors: 尹青山; 高岩; 黄洋
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-06

Abstract

The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning, which comprises the following steps: collecting abnormal data of some key components on a past server; processing the acquired data; extracting data features; training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; using LIME algorithm to reasonably explain the prediction behavior of the model; the beneficial effects are that: the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information.

Description

Server predictive maintenance model method based on machine learning

Technical Field

The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning.

Background

With the rapid development of the IT industry, emerging technologies such as internet +, cloud computing, big data, block chains and the like are also rapidly developed. With the continuous development of information technologies such as cloud computing, big data, 5G, edge computing and the like, the application field of the server is wider and wider, the server is used as key equipment in a machine room and widely used for processing key services and information, and faults and abnormal events of the server can have serious consequences on the continuity and the like of the services, so that higher requirements on the reliability and the usability of the server are provided.

In the prior art, high reliability and high availability need technical support such as efficient fault diagnosis and fault monitoring. How to monitor the operation fault of the server, the reliability and stability of the service operation are effectively improved through the server monitoring, and the method becomes a research hotspot in recent years. Predictive maintenance is an important application area among others.

However, before the generation of predictive maintenance technology, the maintenance of server equipment in a computer room is generally regular maintenance, and periodic maintenance is mainly performed in units of time, which results in resource waste and equipment loss.

Disclosure of Invention

The present invention aims to provide a server predictive maintenance model method based on machine learning to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a machine learning based server predictive maintenance model method, comprising the steps of:

collecting abnormal data of some key components on a past server;

processing the acquired data;

extracting data features;

training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies;

the LIME algorithm is used to make a reasonable interpretation of the predicted behavior of the model.

Preferably, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.

Preferably, statistics of each column of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are utilized, three ratio features are constructed by utilizing a Pearson correlation function in the ratio features, then every two features are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.

Preferably, a multi-model fusion method is adopted, data are processed into vectorized data, and the data are divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold h is set, above which the signal is considered abnormal, and below which the signal is considered normal.

Preferably, by counting the training prediction time consumption of different feature projects, the Macro-F1 score and the time consumption of the model are both increased along with the increase of the features, wherein the Macro-F1 score is increased by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the time for weighing data processing, the feature used by the model is finally determined to be a statistical feature combined with two columns of DTW features.

A server predictive maintenance model system based on machine learning is composed of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;

the data collection module is used for collecting abnormal data of some key components on the past server;

the data analysis module is used for processing the acquired data;

the characteristic construction module is used for extracting data characteristics;

the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;

and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.

Preferably, in the data analysis module, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.

Preferably, in the feature construction module, statistics of each row of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are used, pearson correlation functions in the ratio features are used for constructing three ratio features, then correlation tests are performed on all the features pairwise to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.

Preferably, in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.

Preferably, in the model fusion module, statistics is carried out on training prediction time consumption of different feature projects, and it is found that both the Macro-F1 score and the model time consumption increase with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the achieved result and the time for balancing data processing, it is finally determined that the features used by the model are statistical features combined with two columns of DTW features.

Compared with the prior art, the invention has the beneficial effects that:

the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information. The model adopts a C/S framework to realize data acquisition during specific development, a LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the function of the model is verified through testing the design model. Meanwhile, after a certain period of stable operation, data are continuously accumulated, and the prediction accuracy of the monitoring system on the faults is gradually improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clear and fully described, embodiments of the present invention are further described in detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of some embodiments of the invention and are not limiting of the invention, and that all other embodiments obtained by those of ordinary skill in the art without the exercise of inventive faculty are within the scope of the invention.

Example one

Referring to fig. 1, the present invention provides a technical solution: a machine learning based server predictive maintenance model method, comprising the steps of:

collecting abnormal data of some key components on a past server;

processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;

extracting data features; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, and measuring the similarity between two time sequences by using dynamic time adjustment (DTW), wherein the similarity between two time variables is calculated by extending and shortening the time sequences;

training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; firstly, selecting two important parameters n _ estimators and spare _ rate in the LightGBM to carry out optimization by adopting a LightGBM model, reducing the overfitting degree of the model by using subsample _ for _ bin and subsample _ byte, then carrying out parameter optimization on the n _ estimators and the spare _ rate which have strong influence on XGboost by adopting an XGboost model in the optimization process, and finally obtaining the optimal n _ estimators of 100 and the optimal spare _ rate of 0.01;

using LIME algorithm to reasonably explain the predicted behavior of the model; experiments prove that the single model has poor performance, so that a multi-model fusion method is innovatively adopted to process data into vectorized data, the data is divided by k, and due to the existence of a label, the training data of each model can be distributed identically by using hierarchical division, so that the training effect is improved; for k-fold data, each piece of data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data. Setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; 6. by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time; in order to better understand the prediction made by the black box model, the LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the contribution of each feature to the prediction result can be found through the LIME algorithm, so that whether the judgment is reasonable or not is deduced through human cognition.

Example two

the data analysis module is used for processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;

the characteristic construction module is used for extracting data characteristics; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences;

the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies; a multi-model fusion method is adopted to process the data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time;

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A server predictive maintenance model method based on machine learning is characterized in that: the server predictive maintenance model method based on machine learning comprises the following steps:

collecting abnormal data of some key components on a past server;

processing the acquired data;

extracting data features;

training a server fault prediction model which is urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;

the LIME algorithm is used to make a rational interpretation of the predicted behavior of the model.

2. The machine learning-based server predictive maintenance model method of claim 1, wherein: the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.

3. The machine learning-based server predictive maintenance model method of claim 2, wherein: the method comprises the steps of extracting statistics of each row of features in a file based on statistical features, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio features by using a Pearson correlation function in the ratio features, then performing correlation test on all the features pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences.

4. The machine learning-based server predictive maintenance model method of claim 1, wherein: a multi-model fusion method is adopted to process the data into vectorized data and divide the data into k-fold; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.

5. The machine learning-based server predictive maintenance model method of claim 1, wherein: by carrying out statistics on training prediction time consumption of different feature projects, the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, the time is increased by 10.1% at most, and the feature used by the model is finally determined to be a combination of two columns of DTW features through the sum of the realized result and the weighted data processing time.

6. A machine learning based server predictive maintenance model system according to any of the preceding claims 1-5, characterized by: the system consists of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;

the data analysis module is used for processing the acquired data;

7. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the data analysis module, collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.

8. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the characteristic construction module, statistics of each row of characteristics in the file extracted based on the statistical characteristics, including maximum values, minimum values, average values and variances, are utilized, pearson correlation functions in the ratio characteristics are utilized to construct three ratio characteristics, then every two characteristics are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by adjusting DTW (dynamic time warping) through dynamic time, and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.

9. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data and divide the data by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.

10. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, training prediction time consumption of different feature projects is counted, and it is found that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the sum of time for balancing data processing, the feature used by the model is finally determined to be a combination of two lines of DTW features with the statistical feature.