CN115576778A - Server predictive maintenance model method based on machine learning - Google Patents

Server predictive maintenance model method based on machine learning Download PDF

Info

Publication number
CN115576778A
CN115576778A CN202211299713.3A CN202211299713A CN115576778A CN 115576778 A CN115576778 A CN 115576778A CN 202211299713 A CN202211299713 A CN 202211299713A CN 115576778 A CN115576778 A CN 115576778A
Authority
CN
China
Prior art keywords
data
model
server
time
predictive maintenance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211299713.3A
Other languages
Chinese (zh)
Inventor
尹青山
高岩
黄洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Original Assignee
Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong New Generation Information Industry Technology Research Institute Co Ltd filed Critical Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority to CN202211299713.3A priority Critical patent/CN115576778A/en
Publication of CN115576778A publication Critical patent/CN115576778A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3017Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning, which comprises the following steps: collecting abnormal data of some key components on a past server; processing the acquired data; extracting data features; training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; using LIME algorithm to reasonably explain the prediction behavior of the model; the beneficial effects are that: the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information.

Description

Server predictive maintenance model method based on machine learning
Technical Field
The invention relates to the technical field of fault prediction and health management, in particular to a server predictive maintenance model method based on machine learning.
Background
With the rapid development of the IT industry, emerging technologies such as internet +, cloud computing, big data, block chains and the like are also rapidly developed. With the continuous development of information technologies such as cloud computing, big data, 5G, edge computing and the like, the application field of the server is wider and wider, the server is used as key equipment in a machine room and widely used for processing key services and information, and faults and abnormal events of the server can have serious consequences on the continuity and the like of the services, so that higher requirements on the reliability and the usability of the server are provided.
In the prior art, high reliability and high availability need technical support such as efficient fault diagnosis and fault monitoring. How to monitor the operation fault of the server, the reliability and stability of the service operation are effectively improved through the server monitoring, and the method becomes a research hotspot in recent years. Predictive maintenance is an important application area among others.
However, before the generation of predictive maintenance technology, the maintenance of server equipment in a computer room is generally regular maintenance, and periodic maintenance is mainly performed in units of time, which results in resource waste and equipment loss.
Disclosure of Invention
The present invention aims to provide a server predictive maintenance model method based on machine learning to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a machine learning based server predictive maintenance model method, comprising the steps of:
collecting abnormal data of some key components on a past server;
processing the acquired data;
extracting data features;
training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies;
the LIME algorithm is used to make a reasonable interpretation of the predicted behavior of the model.
Preferably, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.
Preferably, statistics of each column of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are utilized, three ratio features are constructed by utilizing a Pearson correlation function in the ratio features, then every two features are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
Preferably, a multi-model fusion method is adopted, data are processed into vectorized data, and the data are divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold h is set, above which the signal is considered abnormal, and below which the signal is considered normal.
Preferably, by counting the training prediction time consumption of different feature projects, the Macro-F1 score and the time consumption of the model are both increased along with the increase of the features, wherein the Macro-F1 score is increased by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the time for weighing data processing, the feature used by the model is finally determined to be a statistical feature combined with two columns of DTW features.
A server predictive maintenance model system based on machine learning is composed of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data;
the characteristic construction module is used for extracting data characteristics;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
Preferably, in the data analysis module, the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes.
Preferably, in the feature construction module, statistics of each row of features in the file extracted based on the statistical features, including maximum values, minimum values, mean values and variances, are used, pearson correlation functions in the ratio features are used for constructing three ratio features, then correlation tests are performed on all the features pairwise to obtain a correlation matrix, the similarity between two time sequences is measured by using dynamic time adjustment (DTW), and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
Preferably, in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
Preferably, in the model fusion module, statistics is carried out on training prediction time consumption of different feature projects, and it is found that both the Macro-F1 score and the model time consumption increase with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the achieved result and the time for balancing data processing, it is finally determined that the features used by the model are statistical features combined with two columns of DTW features.
Compared with the prior art, the invention has the beneficial effects that:
the server predictive maintenance model method based on machine learning provided by the invention adopts a regression and classification algorithm based on a support vector machine, and after training is carried out on collected data, the model can accurately predict the server fault at a certain time point in the future through captured information. The model adopts a C/S framework to realize data acquisition during specific development, a LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the function of the model is verified through testing the design model. Meanwhile, after a certain period of stable operation, data are continuously accumulated, and the prediction accuracy of the monitoring system on the faults is gradually improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clear and fully described, embodiments of the present invention are further described in detail below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative of some embodiments of the invention and are not limiting of the invention, and that all other embodiments obtained by those of ordinary skill in the art without the exercise of inventive faculty are within the scope of the invention.
Example one
Referring to fig. 1, the present invention provides a technical solution: a machine learning based server predictive maintenance model method, comprising the steps of:
collecting abnormal data of some key components on a past server;
processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;
extracting data features; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, and measuring the similarity between two time sequences by using dynamic time adjustment (DTW), wherein the similarity between two time variables is calculated by extending and shortening the time sequences;
training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and combining machine learning, fault prediction and health management related technologies; firstly, selecting two important parameters n _ estimators and spare _ rate in the LightGBM to carry out optimization by adopting a LightGBM model, reducing the overfitting degree of the model by using subsample _ for _ bin and subsample _ byte, then carrying out parameter optimization on the n _ estimators and the spare _ rate which have strong influence on XGboost by adopting an XGboost model in the optimization process, and finally obtaining the optimal n _ estimators of 100 and the optimal spare _ rate of 0.01;
using LIME algorithm to reasonably explain the predicted behavior of the model; experiments prove that the single model has poor performance, so that a multi-model fusion method is innovatively adopted to process data into vectorized data, the data is divided by k, and due to the existence of a label, the training data of each model can be distributed identically by using hierarchical division, so that the training effect is improved; for k-fold data, each piece of data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data. Setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; 6. by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time; in order to better understand the prediction made by the black box model, the LIME algorithm is used for reasonably explaining the prediction behavior of the model, and the contribution of each feature to the prediction result can be found through the LIME algorithm, so that whether the judgment is reasonable or not is deduced through human cognition.
Example two
A server predictive maintenance model system based on machine learning is composed of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data; analyzing and preprocessing the acquired data, and processing the data into a given label by taking a file as a unit, so that each file has an unfixed number of sampling samples without definite time nodes and workload indexes;
the characteristic construction module is used for extracting data characteristics; extracting statistics of each row of characteristics in a file based on statistical characteristics, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio characteristics by using a Pearson correlation function in the ratio characteristics, then performing correlation test on all the characteristics pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies; a multi-model fusion method is adopted to process the data into vectorized data, and the data is divided by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; setting a threshold value h, wherein the condition that the signal is higher than the threshold value is considered to be abnormal, and the condition that the signal is lower than the threshold value is considered to be normal; by carrying out statistics on training prediction time consumption of different feature projects, finding that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is maximally improved by 2.3%, and the time is maximally increased by 10.1%, and finally determining the features used by the model as statistical features combined with two DTW features by the sum of the realized result and the weighted data processing time;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A server predictive maintenance model method based on machine learning is characterized in that: the server predictive maintenance model method based on machine learning comprises the following steps:
collecting abnormal data of some key components on a past server;
processing the acquired data;
extracting data features;
training a server fault prediction model which is urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
the LIME algorithm is used to make a rational interpretation of the predicted behavior of the model.
2. The machine learning-based server predictive maintenance model method of claim 1, wherein: the collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.
3. The machine learning-based server predictive maintenance model method of claim 2, wherein: the method comprises the steps of extracting statistics of each row of features in a file based on statistical features, wherein the statistics comprises a maximum value, a minimum value, an average value and a variance, constructing three ratio features by using a Pearson correlation function in the ratio features, then performing correlation test on all the features pairwise to obtain a correlation matrix, measuring the similarity between two time sequences by using dynamic time adjustment (DTW), and calculating the similarity between two time variables by extending and shortening the time sequences.
4. The machine learning-based server predictive maintenance model method of claim 1, wherein: a multi-model fusion method is adopted to process the data into vectorized data and divide the data into k-fold; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k sub-models are used for predicting the predicted data, so that k predicted results are obtained, then the (k + 1) th model is directly trained by using all data, and the (k + 1) th result is obtained by predicting the test data; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
5. The machine learning-based server predictive maintenance model method of claim 1, wherein: by carrying out statistics on training prediction time consumption of different feature projects, the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, the time is increased by 10.1% at most, and the feature used by the model is finally determined to be a combination of two columns of DTW features through the sum of the realized result and the weighted data processing time.
6. A machine learning based server predictive maintenance model system according to any of the preceding claims 1-5, characterized by: the system consists of a data collection module, a data analysis module, a feature construction module, a model fusion module and an analysis module;
the data collection module is used for collecting abnormal data of some key components on the past server;
the data analysis module is used for processing the acquired data;
the characteristic construction module is used for extracting data characteristics;
the model fusion module is used for training a server fault prediction model urgently needed by the current machine room according to the extracted data characteristics and by combining machine learning, fault prediction and health management related technologies;
and the analysis module is used for reasonably explaining the predicted behavior of the model by using a LIME algorithm.
7. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the data analysis module, collected data is analyzed and preprocessed, and the data is processed into a given label by taking a file as a unit, so that each file has sampling samples with unfixed number, and no definite time node and workload index exist.
8. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the characteristic construction module, statistics of each row of characteristics in the file extracted based on the statistical characteristics, including maximum values, minimum values, average values and variances, are utilized, pearson correlation functions in the ratio characteristics are utilized to construct three ratio characteristics, then every two characteristics are subjected to correlation test to obtain a correlation matrix, the similarity between two time sequences is measured by adjusting DTW (dynamic time warping) through dynamic time, and the similarity characteristic between two time variables is calculated by extending and shortening the time sequences.
9. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, a multi-model fusion method is adopted to process data into vectorized data and divide the data by k; for k-fold data, each piece of fold data is trained to obtain a model, k sub-models are obtained in total, then the k-fold sub-models are used for predicting predicted data, k predicted results are obtained, then a (k + 1) th model is directly trained by all data, and test data are predicted to obtain a (k + 1) th result; a threshold value h is set, above which the abnormality is identified and below which the abnormality is identified.
10. The machine-learning-based server predictive maintenance model system of claim 6, wherein: in the model fusion module, training prediction time consumption of different feature projects is counted, and it is found that both the Macro-F1 score and the model time consumption are increased along with the increase of features, wherein the Macro-F1 score is improved by 2.3% at most, and the time is increased by 10.1% at most, and by means of the realized result and the sum of time for balancing data processing, the feature used by the model is finally determined to be a combination of two lines of DTW features with the statistical feature.
CN202211299713.3A 2022-10-24 2022-10-24 Server predictive maintenance model method based on machine learning Pending CN115576778A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211299713.3A CN115576778A (en) 2022-10-24 2022-10-24 Server predictive maintenance model method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211299713.3A CN115576778A (en) 2022-10-24 2022-10-24 Server predictive maintenance model method based on machine learning

Publications (1)

Publication Number Publication Date
CN115576778A true CN115576778A (en) 2023-01-06

Family

ID=84587881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211299713.3A Pending CN115576778A (en) 2022-10-24 2022-10-24 Server predictive maintenance model method based on machine learning

Country Status (1)

Country Link
CN (1) CN115576778A (en)

Similar Documents

Publication Publication Date Title
CN111459700B (en) Equipment fault diagnosis method, diagnosis device, diagnosis equipment and storage medium
CN113156917B (en) Power grid equipment fault diagnosis method and system based on artificial intelligence
CN109872003B (en) Object state prediction method, object state prediction system, computer device, and storage medium
CN111259947A (en) Power system fault early warning method and system based on multi-mode learning
KR101872342B1 (en) Method and device for intelligent fault diagnosis using improved rtc(real-time contrasts) method
CN110865924B (en) Health degree diagnosis method and health diagnosis framework for internal server of power information system
CN107133632A (en) A kind of wind power equipment fault diagnosis method and system
CN114167838B (en) Multi-scale health assessment and fault prediction method for servo system
CN111949429A (en) Server fault monitoring method and system based on density clustering algorithm
CN111666978B (en) Intelligent fault early warning system for IT system operation and maintenance big data
CN111913443A (en) Industrial equipment fault early warning method based on similarity
He et al. Intelligent Fault Analysis With AIOps Technology
CN113962308A (en) Aviation equipment fault prediction method
CN117041017A (en) Intelligent operation and maintenance management method and system for data center
CN117354171B (en) Platform health condition early warning method and system based on Internet of things platform
CN111314110B (en) Fault early warning method for distributed system
CN115150248A (en) Network flow abnormity detection method and device, electronic equipment and storage medium
CN117421994A (en) Edge application health monitoring method and system
CN115114124A (en) Host risk assessment method and device
CN115576778A (en) Server predictive maintenance model method based on machine learning
CN110956281A (en) Power equipment abnormity detection alarm system based on Log analysis
CN115543671A (en) Data analysis method, device, equipment, storage medium and program product
CN115017238A (en) Data flow detection classification method capable of dynamically predicting
CN114139408A (en) Power transformer health state assessment method
CN114553756B (en) Equipment fault detection method based on joint generation countermeasure network and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination