CN111027591B - Node fault prediction method for large-scale cluster system - Google Patents

Node fault prediction method for large-scale cluster system Download PDF

Info

Publication number
CN111027591B
CN111027591B CN201911107846.4A CN201911107846A CN111027591B CN 111027591 B CN111027591 B CN 111027591B CN 201911107846 A CN201911107846 A CN 201911107846A CN 111027591 B CN111027591 B CN 111027591B
Authority
CN
China
Prior art keywords
data
window
fault
prediction
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911107846.4A
Other languages
Chinese (zh)
Other versions
CN111027591A (en
Inventor
伍卫国
毛海
聂世强
张驰
董小社
张兴军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201911107846.4A priority Critical patent/CN111027591B/en
Publication of CN111027591A publication Critical patent/CN111027591A/en
Application granted granted Critical
Publication of CN111027591B publication Critical patent/CN111027591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a node fault prediction method facing a large-scale cluster system, which comprises the steps of collecting resource occupation data of each node and generating a data set, constructing a first data prediction model by using a long-term and short-term memory network, constructing a second fault prediction model by using a random forest, establishing a first observation window, judging the size of the first observation window, and returning to be reconstructed if the size of the first observation window does not meet a set value; if the set value is met, predicting data in the lead time window by using a first fault prediction model, combining the first observation window with the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window. The invention ensures that the accuracy of the prediction model is highest on the premise of ensuring that sufficient advance time is available for processing the node fault.

Description

Node fault prediction method for large-scale cluster system
Technical Field
The invention belongs to the technical field of reliability and availability of computer systems, and particularly relates to a node fault prediction method for a large-scale cluster system.
Background
Cluster systems are common platforms for high performance computing, cloud computing, and data centers. With the ever-increasing size and complexity of these platforms, system reliability becomes a major issue because the Mean Time Between Failure (MTBF) of the system decreases as the number of system components increases. Recent research results indicate that the reliability of existing data centers and cloud computing systems is limited by the mean time between failures of 10-100 hours. Data centers typically have a high failure rate because it has many servers and components. Furthermore, long running applications and intensive workloads are common in these facilities. The performance of the system depends on the availability of the machine, which is easily affected if the failure is not handled well.
To meet the increasing demand for cloud computing, internet companies such as google, Facebook, and Amazon typically deploy a large number of servers in their data centers. These servers are heavily loaded and handle a wide variety of requests. For such a high availability computing environment, when one server in a cluster fails, its workload is typically shifted to another machine in the same cluster, which increases the likelihood of other server failures.
Server failures can result in data loss and resource blocking due to sudden unavailability of the machine. In the worst case, these failures can crash the data center, resulting in unexpected downtime, which requires a very high cost to restore the data. As can be seen from the data center outage report issued by Ponemon Institute in 2016, an average of $ 9000/min, and up to $ 17000/min, is required to restore data. Of all server nodes in microsoft cloud system, less than 0.1% of the nodes experience failures each day, but it has a significant impact on services targeting 99.999% or higher availability. Therefore, node failure is one of the major causes of service outages.
On-line fault prediction is a technology for predicting faults by analyzing historical fault data of a machine and the current state of a system, so that adverse effects of the faults on a cluster are avoided or reduced, and the technology is an important means for improving the reliability and the availability of a storage system. While predicting the next failure of a machine appears to be a viable and promising solution to improve data center reliability, it presents two major challenges: the first challenge is that the prediction needs to be highly accurate, especially to reduce false positives. The second challenge is how to select a suitable advance time. If the advance time is too long, the significant features before the failure cannot be fully utilized, so that the model accuracy is low; if the lead time is too short, the prediction accuracy will be improved, but it is not enough for the administrator to have enough time to perform the relevant operations on the node to avoid the failure.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a node fault prediction method for a large-scale cluster system, aiming at the defects in the prior art, so that the accuracy of the prediction model is the highest on the premise of ensuring that sufficient advance time is available for processing node faults.
The invention adopts the following technical scheme:
a node fault prediction method facing a large-scale cluster system collects resource occupation data of each node and generates a data set, a first data prediction model is built by using a long-term and short-term memory network, a second fault prediction model is built by using a random forest, a first observation window is built, the size of the first observation window is judged, and if the size of the first observation window does not meet a set value, the reconstruction is returned; if the set value is met, predicting data in the lead time window by using a first data prediction model, combining the first observation window and the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window.
Specifically, each node collects actual operation parameters, the size of n unit time windows is taken to form an observation window and generate a data set, and each data of the node in the advance time window is predicted by using each data in the observation time window.
Further, the period for the nodes to collect actual operating parameters is every 5 minutes.
Further, each item of prediction data Y in the time period taur,τComprises the following steps:
Yr,τ=f(P(t))
where f denotes the model to be solved, P (t) is the vector formed by all data, t ∈ (1, τ -1), r ∈ resources.
Specifically, the input of the long-short term memory network comprises the number of training samples, a time step and a characteristic value, and the characteristic value is represented by a vector P (t) formed by all data.
Furthermore, correlation coefficients between each characteristic value and faults are obtained by calculating Pearson correlation coefficients, and 9 characteristic values with correlation coefficients larger than 0.1 are selected from actual operation parameters collected by the nodes to serve as final characteristic values.
Further, the characteristic value data is: a mean CPU usage rate, a local mean usage, a local mean utilization, a maximum CPU usage, a maximum disk IO time, a mean access per instance instruction.
Specifically, the input of the random forest is a vector P consisting of characteristic values in the first observation window(t)And a vector Y composed of eigenvalues within the lead time window(t1)And obtaining whether a fault occurs in a prediction window through the prediction behavior, wherein y represents whether the fault occurs in the prediction window as follows:
y=f(P(t),Y(t1))
where f represents the model to be solved, 1 represents a fault and 0 represents a non-fault.
Compared with the prior art, the invention at least has the following beneficial effects:
the node fault prediction method for the large-scale cluster system can accurately predict the resource occupancy change condition of the nodes in a future period of time; and finally predicting the fault of the node by using the random forest according to the predicted node resource occupation data and the real resource occupation data, wherein the fault prediction of the node only needs to predict the machine state of the next time period, so that the method is a two-classification problem, and the random forest has higher accuracy in a classification algorithm. The random forest is not easy to fall into overfitting, high-dimensionality data can be processed, feature selection is not needed, and adaptability to a data set is high.
Furthermore, through the data prediction of the first stage, the resource occupation data of the nodes in the advance time window is predicted, the defect that no data exists in the advance time window in the traditional fault prediction method is overcome, and when the node fault prediction of the second stage is carried out, the data in the advance time window can be fully utilized, so that the prediction accuracy is improved.
Furthermore, indexes related to resource occupation amount in the nodes are many, different characteristic values have different influences on a fault prediction algorithm, and correlation coefficients between the characteristic values and faults are obtained by calculating Pearson correlation coefficients, so that the characteristic values required in prediction are determined, and influences of useless characteristic values on fault prediction are avoided.
In summary, the present invention can effectively predict the data in the time window ahead by using the advantage of LSTM for processing the data with high correlation with the time sequence and the data with longer distance in the time sequence. And then, combining the real data to jointly form data in an observation window, and performing final fault prediction by using a random forest method. Not only is advance time reserved for coping with faults, but also data in an advance time window is fully utilized, so that the accuracy of the model is ensured.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a time window definition diagram;
FIG. 2 is a diagram of a new time window definition;
FIG. 3 is a view of the internal structure of the LSTM;
FIG. 4 is a prediction flow chart according to the present invention.
Detailed Description
Referring to fig. 4, the node failure prediction method for a large-scale cluster system of the present invention includes collecting occupancy data of each node resource, performing data processing to generate a data set, constructing a first data prediction model using a long-short term memory network (LSTM), constructing a second failure prediction model using a random forest, establishing first observation window data, determining whether the size of the first observation window is equal to 3 hours, and returning to re-construction if the size of the first observation window is not equal to 3 hours; if yes, predicting data in the time advance window by using a first data prediction model, combining the first observation window with the data in the time advance window to form a second observation window, judging whether the size of the second observation window is equal to 4 hours, and if not, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window.
S1, predicting the node resource occupation amount based on a long-term and short-term memory network (LSTM);
in fault prediction, data within an observation window is typically used to predict whether a fault has occurred within a prediction window. The closer to the moment when the fault occurs, the more obvious the sign of the fault, i.e. the more important the characteristic value. However, in order to ensure that the administrator has enough time to cope with the failure, a lead time, i.e. the lead time window in fig. 1, must be reserved, which results in that the data in the lead time window cannot be fully utilized when performing prediction, resulting in a reduction in the prediction accuracy. Therefore, in order to ensure the prediction accuracy by using the data in the advanced time window, a node resource occupation prediction method based on a long-short term memory network (LSTM) is proposed, the data in the advanced time window is predicted by the LSTM, and the size of the observation window is enlarged, at this time, a schematic diagram of a new time window is shown in fig. 2.
LSTM (Long short-term memory) is a special RNN (recurrent neural network) that can learn Long dependencies; LSTM is carefully designed to avoid long dependency problems. LSTM is advantageous for processing data that has a high degree of correlation with time series, and is also highly advantageous when processing nodes that are far apart in time series. At this time, a diagram of a new time window is shown in fig. 2.
The data set was generated using the n unit time window sizes to construct an observation window using the actual operating parameters collected by the node every 5 minutes.
When the characteristic values are selected, correlation coefficients between each characteristic value and faults are obtained by calculating Pearson correlation coefficients, 9 characteristic values with the correlation coefficients larger than 0.1 are selected from actual operation parameters collected by nodes as final characteristic values, and the selected characteristic values are as follows:
mean CPU usage rate,canonical memory usage,total page cache memory usage, maximum memory usage,mean disk I/O time,mean local disk space used,maximum CPU usage,maximum disk IO time,memory accesses per instruction。
in the time period from 1 to tau-1, all characteristic values are normalized according to respective maximum values, the range is from 0 to 1, and a vector formed by the normalized characteristic value data is represented by P (t):
P(t)=Ur,t,t∈(1,τ-1),r∈resources
the LSTM input layer includes the number of training samples (samples), time steps (time steps) and feature values (features). The time step is how many time series of input data before each data is related. The eigenvalues are represented by the vector p (t).
The invention selects the following data as characteristic values:
mean CPU usage rate,canonical memory usage,total page cache memory usage, maximum memory usage,mean disk I/O time,mean local disk space used,maximum CPU usage,maximum disk IO time,memory accesses per instruction。
the measured values are normalized according to respective maximum values, and the range is from 0 to 1; f denotes the model to be solved, Yr,τRepresenting the items of predicted data within the period of τ, the predicted behavior is represented as:
Yr,τ=f(P(t))
wherein t ∈ (1, τ -1), r ∈ resources.
And predicting various data of the nodes in the time window in advance by using various data of the nodes in the observation time window.
The internal structure of the LSTM is shown in fig. 3. The gates are realized by selectively passing information, mainly through a sigmoid neural network layer and a point-by-point multiplication operation, which is why the LSTM has 3 multiplication numbers because there are 3 gates, namely a forgetting gate (forget gate), an input gate (input gate) and an output gate (output gate).
The forget gate is used to decide which information to discard from the cell state.
The input gate is used to determine which update information is stored in the cell state. This process requires the following steps:
firstly, the sigmoid layer determines which information needs to be updated, the tanh layer generates a vector, the two parts of the value are updated to (-1,1), the two parts jointly form an input gate, and then the two vectors are combined to create an updated value.
And then, overlapping the old state and the new state to obtain the new state. The output gate determines what is output. Based on the cell state, firstly operating a sigmoid layer to determine a part to output the cell state;
finally, the cell state is passed through tanh, the value is normalized to between-1 and 1, and multiplied by the output of the sigmoid gate, so far only the decided portion is output.
The LSTM parameters are set as follows:
the time step is set to 36 (one time period every 5 minutes for 3 hours), i.e. each data is associated with the data of the previous 36 time periods.
The eigenvalue is set to 9.
The translation, i.e. the activation function, is set to 'relu'.
Dropout is set to 0.2.
The Batch _ size is set to 196.
The number of hidden layer nodes is set to 5.
And S2, predicting faults based on the random forest.
The fault prediction algorithm plays an important role in predicting the accuracy rate, and for node fault prediction, a supervised machine learning method is generally adopted, and because the resource occupation amount and the machine state of the nodes in different time periods are different, the unsupervised learning method is adopted, so that the corresponding relation between the resource occupation amount and the machine state cannot be well established, and the accuracy rate of the model is reduced.
The random forest is a supervised learning algorithm, and is an integrated learning algorithm taking a decision tree as a base learner. The random forest is not easy to fall into overfitting, high-dimensionality data can be processed, feature selection is not needed, and adaptability to a data set is high. In the classification algorithm, the random forest has higher accuracy. Therefore, a random forest algorithm is adopted when fault prediction is performed.
The fault prediction based on the random forest specifically comprises the following steps:
after the first stage, the node resource occupancy data in the advanced time window is predicted, and at this time, a second observation window, such as the observation window in fig. 2, is formed by combining the first observation window before the advanced time window, such as the observation window in fig. 1, and a random forest is used to predict whether a fault will occur in a future period of time (i.e., the prediction window).
The input of the random forest is a vector P consisting of characteristic values in the first observation window(t)And a vector Y of eigenvalues within the advance time window(t1)F represents the model to be solved, y represents whether a fault occurs in the prediction window, and the prediction behavior is represented as:
y=f(P(t),Y(t1))
after the prediction behavior, whether a fault occurs in the prediction window or not is obtained, wherein 1 represents a fault, and 0 represents a non-fault.
The random forest parameters were set as follows:
n _ estimators is set to 20.
max _ depth is set to 50.
min _ samples _ leaf is set to 20.
min _ samples _ split is set to 30.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The node resource occupancy prediction method based on LSTM is shown in Table 1:
table 1 node resource occupancy prediction method based on LSTM
Figure BDA0002271851230000091
Figure BDA0002271851230000101
The fault prediction method based on random forests is shown in table 2.
Table 2. Fault prediction method based on random forest
Figure BDA0002271851230000102
Figure BDA0002271851230000111
At present, the research on node fault prediction at home and abroad does not fully consider a lead time window, and the lead time is not generally set, so that sufficient time cannot be provided for an administrator to take measures to avoid faults. Secondly, the data before the fault occurs has obvious symptoms on the fault, and even if the advance time is set, the data in the advance time window cannot be utilized, so that the prediction accuracy is reduced.
The invention can effectively predict the data in the time window ahead by utilizing the advantages of the LSTM on processing the data with high correlation degree with the time sequence and processing the data with longer distance on the time sequence. And then, combining the real data to jointly form data in an observation window, and performing final fault prediction by using a random forest method. Not only is advance time reserved, but also data in an advance time window is fully utilized, so that the accuracy of the model is ensured.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (7)

1. A node fault prediction method for a large-scale cluster system is characterized in that resource occupation data of all nodes are collected and a data set is generated, a first data prediction model is built by using a long-term and short-term memory network, a second fault prediction model is built by using a random forest, a first observation window is built, the size of the first observation window is judged, and if the size does not meet a set value, the reconstruction is returned; if the set value is met, predicting data in the lead time window by using a first data prediction model, combining the first observation window and the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, predicting the fault in the prediction window by using a second fault prediction model;
when fault prediction is carried out, whether a fault occurs in a prediction window is predicted by using data in a second observation window, a period of time is reserved before the fault occurrence time to serve as a time-ahead window, a first observation window is arranged in front of the time-ahead window, a prediction window is arranged behind the time-ahead window, and whether the prediction window can generate the fault is predicted by using a random forest.
2. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein each node collects actual operation parameters, takes the size of n unit time windows to form an observation window and generate a data set, and predicts each data of the node in an advance time window by using each data of the node in the observation time window.
3. The large-scale cluster system-oriented node failure prediction method according to claim 2, wherein the period for collecting actual operation parameters by the nodes is every 5 minutes.
4. The large-scale cluster system-oriented node fault prediction method of claim 2, wherein each item of prediction data Y in the time period of τ isr,τComprises the following steps:
Yr,τ=f(P(t))
where f denotes the model to be solved, P (t) is the vector formed by all data, t ∈ (1, τ -1), r ∈ resources.
5. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein the input of the long-short term memory network comprises a training sample number, a time step and an eigenvalue, and the eigenvalue is represented by a vector P (t) composed of all data.
6. The large-scale cluster system-oriented node fault prediction method of claim 5, wherein correlation coefficients between each eigenvalue and a fault are obtained by calculating Pearson correlation coefficients, and 9 eigenvalues with correlation coefficients larger than 0.1 are selected from actual operation parameters collected by nodes as final eigenvalues.
7. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein the input of the random forest is a vector P consisting of eigenvalues in the first observation window(t)And a vector Y of eigenvalues within the advance time window(t1)Obtaining whether a fault occurs in the prediction window through the prediction behavior,whether a fault occurs within the prediction window y is represented as:
y=f(P(t),Y(t1))
where f represents the model to be solved, 1 represents a fault and 0 represents a non-fault.
CN201911107846.4A 2019-11-13 2019-11-13 Node fault prediction method for large-scale cluster system Active CN111027591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911107846.4A CN111027591B (en) 2019-11-13 2019-11-13 Node fault prediction method for large-scale cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911107846.4A CN111027591B (en) 2019-11-13 2019-11-13 Node fault prediction method for large-scale cluster system

Publications (2)

Publication Number Publication Date
CN111027591A CN111027591A (en) 2020-04-17
CN111027591B true CN111027591B (en) 2022-07-12

Family

ID=70205580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911107846.4A Active CN111027591B (en) 2019-11-13 2019-11-13 Node fault prediction method for large-scale cluster system

Country Status (1)

Country Link
CN (1) CN111027591B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085111B (en) * 2020-09-14 2024-01-23 南方电网科学研究院有限责任公司 Load identification method and device
CN113076239B (en) * 2021-04-12 2023-05-23 西安交通大学 Hybrid neural network fault prediction method and system for high-performance computer
CN114462679A (en) * 2022-01-04 2022-05-10 广州杰赛科技股份有限公司 Network traffic prediction method, device, equipment and medium based on deep learning

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909989A (en) * 2017-02-24 2017-06-30 国网河南省电力公司电力科学研究院 A kind of grid disturbance Forecasting Methodology and device
CN107679649A (en) * 2017-09-13 2018-02-09 珠海格力电器股份有限公司 A kind of failure prediction method of electrical equipment, device, storage medium and electrical equipment
WO2018034745A1 (en) * 2016-08-18 2018-02-22 The Regents Of The University Of California Nanopore sequencing base calling
CN107769972A (en) * 2017-10-25 2018-03-06 武汉大学 A kind of power telecom network equipment fault Forecasting Methodology based on improved LSTM
CN108090558A (en) * 2018-01-03 2018-05-29 华南理工大学 A kind of automatic complementing method of time series missing values based on shot and long term memory network
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network
CN108415789A (en) * 2018-01-24 2018-08-17 西安交通大学 Node failure forecasting system and method towards extensive mixing heterogeneous storage system
CN108900546A (en) * 2018-08-13 2018-11-27 杭州安恒信息技术股份有限公司 The method and apparatus of time series Network anomaly detection based on LSTM
CN109033450A (en) * 2018-08-22 2018-12-18 太原理工大学 Lift facility failure prediction method based on deep learning
CN110198223A (en) * 2018-02-27 2019-09-03 中兴通讯股份有限公司 Network failure prediction technique, device and equipment, storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018034745A1 (en) * 2016-08-18 2018-02-22 The Regents Of The University Of California Nanopore sequencing base calling
CN106909989A (en) * 2017-02-24 2017-06-30 国网河南省电力公司电力科学研究院 A kind of grid disturbance Forecasting Methodology and device
CN107679649A (en) * 2017-09-13 2018-02-09 珠海格力电器股份有限公司 A kind of failure prediction method of electrical equipment, device, storage medium and electrical equipment
CN107769972A (en) * 2017-10-25 2018-03-06 武汉大学 A kind of power telecom network equipment fault Forecasting Methodology based on improved LSTM
CN108170529A (en) * 2017-12-26 2018-06-15 北京工业大学 A kind of cloud data center load predicting method based on shot and long term memory network
CN108090558A (en) * 2018-01-03 2018-05-29 华南理工大学 A kind of automatic complementing method of time series missing values based on shot and long term memory network
CN108415789A (en) * 2018-01-24 2018-08-17 西安交通大学 Node failure forecasting system and method towards extensive mixing heterogeneous storage system
CN110198223A (en) * 2018-02-27 2019-09-03 中兴通讯股份有限公司 Network failure prediction technique, device and equipment, storage medium
CN108900546A (en) * 2018-08-13 2018-11-27 杭州安恒信息技术股份有限公司 The method and apparatus of time series Network anomaly detection based on LSTM
CN109033450A (en) * 2018-08-22 2018-12-18 太原理工大学 Lift facility failure prediction method based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Predicting Node failure in cloud service systems;Qingwei Lin 等;《ESEC/FSE 2018》;20181026;第480-490页 *
面向工业大数据的时间序列预测关键技术研究;宋杨;《中国硕士学位论文全文数据库》;20190115;第2019年卷(第1期);A002-1156 *

Also Published As

Publication number Publication date
CN111027591A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111027591B (en) Node fault prediction method for large-scale cluster system
US20220255817A1 (en) Machine learning-based vnf anomaly detection system and method for virtual network management
Zhang et al. Resource requests prediction in the cloud computing environment with a deep belief network
CN114297036B (en) Data processing method, device, electronic equipment and readable storage medium
CN113312447A (en) Semi-supervised log anomaly detection method based on probability label estimation
CN112433896B (en) Method, device, equipment and storage medium for predicting server disk faults
Zhang et al. Energy theft detection in an edge data center using threshold-based abnormality detector
CN117078048B (en) Digital twinning-based intelligent city resource management method and system
CN115112372A (en) Bearing fault diagnosis method and device, electronic equipment and storage medium
Gupta et al. A supervised deep learning framework for proactive anomaly detection in cloud workloads
CN117234301A (en) Server thermal management method based on artificial intelligence
CN110543462A (en) Microservice reliability prediction method, prediction device, electronic device, and storage medium
WO2020220437A1 (en) Method for virtual machine software aging prediction based on adaboost-elman
CN113886454A (en) Cloud resource prediction method based on LSTM-RBF
Tuli et al. Deepft: Fault-tolerant edge computing using a self-supervised deep surrogate model
Sun et al. Aledar: An attentions-based encoder-decoder and autoregressive model for workload forecasting of cloud data center
CN115423041A (en) Edge cloud fault prediction method and system based on deep learning
WO2022251004A1 (en) Hierarchical neural network-based root cause analysis for distributed computing systems
CN115408182A (en) Service system fault positioning method and device
Georgoulopoulos et al. A survey on hardware failure prediction of servers using machine learning and deep learning
CN113535522A (en) Abnormal condition detection method, device and equipment
Luo et al. Intelligent Identification over Power Big Data: Opportunities, Solutions, and Challenges.
Chen et al. Decision tree-based prediction approach for improving stable energy management in smart grids
CN117560275B (en) Root cause positioning method and device for micro-service system based on graphic neural network model
Singh et al. A feature extraction and time warping based neural expansion architecture for cloud resource usage forecasting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant