CN115576778A - Server predictive maintenance model method based on machine learning - Google Patents
Server predictive maintenance model method based on machine learning Download PDFInfo
- Publication number
- CN115576778A CN115576778A CN202211299713.3A CN202211299713A CN115576778A CN 115576778 A CN115576778 A CN 115576778A CN 202211299713 A CN202211299713 A CN 202211299713A CN 115576778 A CN115576778 A CN 115576778A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- server
- time
- predictive maintenance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3017—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3447—Performance evaluation by modeling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning, which comprises the following steps: collecting abnormal data of some key components on a past server; processing the acquired data; extracting data features; training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; using LIME algorithm to reasonably explain the prediction behavior of the model; the beneficial effects are that: the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information.
Description
Technical Field
The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning.
Background
With the rapid development of the IT industry, emerging technologies such as internet +, cloud computing, big data, block chains and the like are also rapidly developed. With the continuous development of information technologies such as cloud computing, big data, 5G, edge computing and the like, the application field of the server is wider and wider, the server is used as key equipment in a machine room and widely used for processing key services and information, and faults and abnormal events of the server can have serious consequences on the continuity and the like of the services, so that higher requirements on the reliability and the usability of the server are provided.
In the prior art, high reliability and high availability need technical support such as efficient fault diagnosis and fault monitoring. How to monitor the operation fault of the server, the reliability and stability of the service operation are effectively improved through the server monitoring, and the method becomes a research hotspot in recent years. Predictive maintenance is an important application area among others.
However, before the generation of predictive maintenance technology, the maintenance of server equipment in a computer room is generally regular maintenance, and periodic maintenance is mainly performed in units of time, which results in resource waste and equipment loss.
Disclosure of Invention
The present invention aims to provide a server predictive maintenance model method based on machine learning to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a machine learning based server predictive maintenance model method, comprising the steps of:
collecting abnormal data of some key components on a past server;
processing the acquired data;
extracting data features;
training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies;
the LIME algorithm is used to make a reasonable interpretation of the predicted behavior of the model.
Preferably, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.
Preferably, statistics of each column of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are utilized, three ratio features are constructed by utilizing a Pearson correlation function in the ratio features, then every two features are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
Preferably, a multi-model fusion method is adopted, data are processed into vectorized data, and the data are divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold h is set, above which the signal is considered abnormal, and below which the signal is considered normal.
Preferably, by counting the training prediction time consumption of different feature projects, the Macro-F1 score and the time consumption of the model are both increased along with the increase of the features, wherein the Macro-F1 score is increased by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the time for weighing data processing, the feature used by the model is finally determined to be a statistical feature combined with two columns of DTW features.
A server predictive maintenance model system based on machine learning is composed of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data;
the characteristic construction module is used for extracting data characteristics;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
Preferably, in the data analysis module, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.
Preferably, in the feature construction module, statistics of each row of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are used, pearson correlation functions in the ratio features are used for constructing three ratio features, then correlation tests are performed on all the features pairwise to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
Preferably, in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
Preferably, in the model fusion module, statistics is carried out on training prediction time consumption of different feature projects, and it is found that both the Macro-F1 score and the model time consumption increase with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the achieved result and the time for balancing data processing, it is finally determined that the features used by the model are statistical features combined with two columns of DTW features.
Compared with the prior art, the invention has the beneficial effects that:
the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information. The model adopts a C/S framework to realize data acquisition during specific development, a LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the function of the model is verified through testing the design model. Meanwhile, after a certain period of stable operation, data are continuously accumulated, and the prediction accuracy of the monitoring system on the faults is gradually improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clear and fully described, embodiments of the present invention are further described in detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of some embodiments of the invention and are not limiting of the invention, and that all other embodiments obtained by those of ordinary skill in the art without the exercise of inventive faculty are within the scope of the invention.
Example one
Referring to fig. 1, the present invention provides a technical solution: a machine learning based server predictive maintenance model method, comprising the steps of:
collecting abnormal data of some key components on a past server;
processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;
extracting data features; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, and measuring the similarity between two time sequences by using dynamic time adjustment (DTW), wherein the similarity between two time variables is calculated by extending and shortening the time sequences;
training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; firstly, selecting two important parameters n _ estimators and spare _ rate in the LightGBM to carry out optimization by adopting a LightGBM model, reducing the overfitting degree of the model by using subsample _ for _ bin and subsample _ byte, then carrying out parameter optimization on the n _ estimators and the spare _ rate which have strong influence on XGboost by adopting an XGboost model in the optimization process, and finally obtaining the optimal n _ estimators of 100 and the optimal spare _ rate of 0.01;
using LIME algorithm to reasonably explain the predicted behavior of the model; experiments prove that the single model has poor performance, so that a multi-model fusion method is innovatively adopted to process data into vectorized data, the data is divided by k, and due to the existence of a label, the training data of each model can be distributed identically by using hierarchical division, so that the training effect is improved; for k-fold data, each piece of data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data. Setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; 6. by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time; in order to better understand the prediction made by the black box model, the LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the contribution of each feature to the prediction result can be found through the LIME algorithm, so that whether the judgment is reasonable or not is deduced through human cognition.
Example two
A server predictive maintenance model system based on machine learning is composed of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;
the characteristic construction module is used for extracting data characteristics; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies; a multi-model fusion method is adopted to process the data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (10)
1. A server predictive maintenance model method based on machine learning is characterized in that: the server predictive maintenance model method based on machine learning comprises the following steps:
collecting abnormal data of some key components on a past server;
processing the acquired data;
extracting data features;
training a server fault prediction model which is urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
the LIME algorithm is used to make a rational interpretation of the predicted behavior of the model.
2. The machine learning-based server predictive maintenance model method of claim 1, wherein: the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.
3. The machine learning-based server predictive maintenance model method of claim 2, wherein: the method comprises the steps of extracting statistics of each row of features in a file based on statistical features, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio features by using a Pearson correlation function in the ratio features, then performing correlation test on all the features pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences.
4. The machine learning-based server predictive maintenance model method of claim 1, wherein: a multi-model fusion method is adopted to process the data into vectorized data and divide the data into k-fold; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
5. The machine learning-based server predictive maintenance model method of claim 1, wherein: by carrying out statistics on training prediction time consumption of different feature projects, the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, the time is increased by 10.1% at most, and the feature used by the model is finally determined to be a combination of two columns of DTW features through the sum of the realized result and the weighted data processing time.
6. A machine learning based server predictive maintenance model system according to any of the preceding claims 1-5, characterized by: the system consists of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data;
the characteristic construction module is used for extracting data characteristics;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
7. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the data analysis module, collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.
8. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the characteristic construction module, statistics of each row of characteristics in the file extracted based on the statistical characteristics, including maximum values, minimum values, average values and variances, are utilized, pearson correlation functions in the ratio characteristics are utilized to construct three ratio characteristics, then every two characteristics are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by adjusting DTW (dynamic time warping) through dynamic time, and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
9. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data and divide the data by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
10. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, training prediction time consumption of different feature projects is counted, and it is found that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the sum of time for balancing data processing, the feature used by the model is finally determined to be a combination of two lines of DTW features with the statistical feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211299713.3A CN115576778A (en) | 2022-10-24 | 2022-10-24 | Server predictive maintenance model method based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211299713.3A CN115576778A (en) | 2022-10-24 | 2022-10-24 | Server predictive maintenance model method based on machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115576778A true CN115576778A (en) | 2023-01-06 |
Family
ID=84587881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211299713.3A Pending CN115576778A (en) | 2022-10-24 | 2022-10-24 | Server predictive maintenance model method based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115576778A (en) |
-
2022
- 2022-10-24 CN CN202211299713.3A patent/CN115576778A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111459700B (en) | Equipment fault diagnosis method, diagnosis device, diagnosis equipment and storage medium | |
CN113156917B (en) | Power grid equipment fault diagnosis method and system based on artificial intelligence | |
CN109872003B (en) | Object state prediction method, object state prediction system, computer device, and storage medium | |
CN111259947A (en) | Power system fault early warning method and system based on multi-mode learning | |
KR101872342B1 (en) | Method and device for intelligent fault diagnosis using improved rtc(real-time contrasts) method | |
CN110865924B (en) | Health degree diagnosis method and health diagnosis framework for internal server of power information system | |
CN107133632A (en) | A kind of wind power equipment fault diagnosis method and system | |
CN114167838B (en) | Multi-scale health assessment and fault prediction method for servo system | |
CN111949429A (en) | Server fault monitoring method and system based on density clustering algorithm | |
CN111666978B (en) | Intelligent fault early warning system for IT system operation and maintenance big data | |
CN111913443A (en) | Industrial equipment fault early warning method based on similarity | |
He et al. | Intelligent Fault Analysis With AIOps Technology | |
CN113962308A (en) | Aviation equipment fault prediction method | |
CN117041017A (en) | Intelligent operation and maintenance management method and system for data center | |
CN117354171B (en) | Platform health condition early warning method and system based on Internet of things platform | |
CN111314110B (en) | Fault early warning method for distributed system | |
CN115150248A (en) | Network flow abnormity detection method and device, electronic equipment and storage medium | |
CN117421994A (en) | Edge application health monitoring method and system | |
CN115114124A (en) | Host risk assessment method and device | |
CN115576778A (en) | Server predictive maintenance model method based on machine learning | |
CN110956281A (en) | Power equipment abnormity detection alarm system based on Log analysis | |
CN115543671A (en) | Data analysis method, device, equipment, storage medium and program product | |
CN115017238A (en) | Data flow detection classification method capable of dynamically predicting | |
CN114139408A (en) | Power transformer health state assessment method | |
CN114553756B (en) | Equipment fault detection method based on joint generation countermeasure network and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |