CN111290913A - Fault location visualization system and method based on operation and maintenance data prediction - Google Patents

Fault location visualization system and method based on operation and maintenance data prediction Download PDF

Info

Publication number
CN111290913A
CN111290913A CN202010079674.0A CN202010079674A CN111290913A CN 111290913 A CN111290913 A CN 111290913A CN 202010079674 A CN202010079674 A CN 202010079674A CN 111290913 A CN111290913 A CN 111290913A
Authority
CN
China
Prior art keywords
information
module
prediction
fault
maintenance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010079674.0A
Other languages
Chinese (zh)
Inventor
王子健
周扬帆
付娇娇
陈昊
蔡煜
曹袖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
CERNET Corp
Original Assignee
Fudan University
CERNET Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University, CERNET Corp filed Critical Fudan University
Priority to CN202010079674.0A priority Critical patent/CN111290913A/en
Publication of CN111290913A publication Critical patent/CN111290913A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/323Visualisation of programs or trace data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a fault location visualization system and method based on operation and maintenance data prediction. The invention collects the machine and application log information in the network cluster by using the existing log collection framework to generate operation and maintenance big data, and then processes the operation and maintenance big data by using an artificial intelligence method, thereby predicting the network fault in advance and displaying the network fault in a visual mode. The invention has the beneficial effects that: the fault can be identified efficiently and accurately, the network fault can be predicted in advance, the operation and maintenance personnel can be given sufficient time to carry out operation and maintenance work in time, and the work efficiency of the operation and maintenance personnel can be improved effectively.

Description

Fault location visualization system and method based on operation and maintenance data prediction
Technical Field
The invention relates to a fault location visualization system and method based on operation and maintenance data prediction, and relates to the technical field of computer networks and intelligent operation and maintenance.
Background
With the gradual maturity of network technology, the coverage area of a wireless network is gradually enlarged, and meanwhile, more and more intelligent devices are arranged at the sides of people, so that the number of devices accessed in the network is increased dramatically. On the basis, the quality of the information service provided by the network is a key factor influencing the user experience, for example, if an access authentication system exists in the network, if a fault occurs, the working efficiency of a user of the access system may be greatly reduced. Based on the above two points, the current network has higher requirements on the accuracy and timeliness of fault location, so that fault detection and location of the network become a key research problem at present.
At present, the frames for positioning and analyzing faults in the network include eSigt, ELK, Splunk, zabbix and the like, and the functions of the fault positioning and analyzing frames are mainly realized in the same way. The method comprises the steps that an application is deployed on a terminal to collect log information on equipment, then key information of the equipment is extracted based on the log information, faults on the equipment are located and analyzed through setting a threshold value and a simple statistical analysis method, and a visual information and information notification mechanism is provided to assist operation and maintenance personnel in operation and maintenance work.
While there are so many network fault analysis and localization frameworks, these solutions also have many drawbacks and deficiencies in the current environment. Firstly, most schemes mainly rely on a threshold value method and error log information extraction to detect faults and position in the aspect of abnormal detection, errors are difficult to find in time by the method, the method is easy to miss reports, and the existing risks are larger and larger under the increasingly strong network environment. Moreover, the threshold value of the frame is generally specified by experts, and the workload of re-understanding the set parameters of the system is overlarge and the accuracy is not high along with the frequent update of the system; the second problem is that the number of access devices is increased dramatically, so that the log information is too large in data size, even if the log information is processed by the current framework, the information size covered by the log information is too large, and it is difficult to assist the operation and maintenance personnel to understand the current network state in time, so that many faults cannot be processed in time under the condition that the operation and maintenance personnel cannot understand the machine state in time, and the operation and maintenance personnel get into a passive operation and maintenance state waiting for the user to feed back and then perform operation and maintenance.
Disclosure of Invention
In view of the defects of the prior art, the invention aims to provide a fault location visualization system and method based on operation and maintenance data prediction. The method and the system efficiently and accurately identify the fault based on the artificial intelligence technology, and can predict the network fault in advance, thereby giving sufficient time to operation and maintenance personnel to carry out operation and maintenance work in time, and effectively assisting in improving the working efficiency of the operation and maintenance personnel.
The technical scheme of the invention is specifically introduced as follows.
A fault location visualization system based on operation and maintenance data prediction comprises a data collection part, an algorithm prediction part and a visualization display part; wherein:
a data collection part: collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;
and an algorithm prediction part: learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; then, carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;
and the visualization part is used for displaying all log information and fault prediction information.
In the invention, the data collection part comprises a log collection module and a performance index extraction module; wherein:
a log collection module: the system comprises a distributed log collection component and a centralized log storage component, wherein the distributed log collection component collects machine and application logs of all machines in a cluster, and completes the work of log redirection centralized processing;
a performance index extraction module: index information reflecting the performance of the machine in the log is extracted, and operation and maintenance big data are constructed by a time sequence extraction method.
In the invention, in the performance index extraction module, index information reflecting the performance of the machine is extracted by a filtering and keyword extraction method; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.
In the invention, the algorithm prediction part comprises an algorithm self-adaptive module and a prediction fault module; wherein:
an algorithm self-adaptive module: training a model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data of the previous day to obtain an artificial intelligence model for predicting faults on the same day;
a predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model.
In the invention, a visualization part comprises a real-time monitoring module, a history prediction module, a log information retrieval module and a machine performance curve display module; wherein:
a real-time monitoring module: the system is used for displaying the real-time performance state of the machines in the cluster; the performance state comprises CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth, failure prediction information and machine state obtained by collecting information through a threshold value and an SNMP method;
a history prediction module: the system comprises a cache module, a fault prediction module and a fault prediction module, wherein the cache module is used for caching historical fault prediction information of machines in a cluster;
a log information retrieval module: for providing extracted machine performance log information;
the machine performance curve display module: the system is used for displaying the log information to the operation and maintenance personnel in a graph mode.
The invention also provides a fault location visualization method based on the system, which comprises the following specific steps:
(1) collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;
(2) learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;
(3) and displaying all log information and fault prediction information.
In the invention, in the step (1), index information reflecting the performance of the machine is extracted by a filtering and keyword extracting method; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.
In the invention, in the step (2), the artificial intelligence model for predicting the faults on the same day is obtained by training the model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data on the previous day.
In the present invention, step (3) is shown in the form of a graph and a table.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, accurate prediction and positioning of network faults can be realized through a machine learning and neural network method and a training model obtained by learning the prior experience based on collected historical operation and maintenance big data;
the invention displays the conventional log monitoring information and the predicted fault information of each machine through the visual operation and maintenance information platform, and can effectively and intuitively finish the operation and maintenance task by operation and maintenance personnel.
Drawings
Fig. 1 is a system architecture diagram.
Fig. 2 is a flow chart of the system operation.
Fig. 3 is a diagram showing a structure of a data collection portion.
FIG. 4 is a block diagram of the prediction portion of the algorithm.
Fig. 5 shows a part of the structure of the image.
Detailed Description
The technical solution of the present invention will be described in detail with reference to the accompanying drawings and embodiments.
The invention firstly utilizes a distributed log collection framework to collect machine log information on each device, then extracts key information in the machine log information as a performance index of the machine at that time, and constructs operation and maintenance big data. And then, the processed operation and maintenance big data is used for training an artificial intelligence model, when real-time latest performance data is input, the model can predict whether the machines have faults in a later period of time, and the faults are visualized to operation and maintenance personnel in the form of diagrams and tables, so that the operation and maintenance personnel can be helped to determine problems in advance and prepare solutions, and the problem that the existing network complex operation and maintenance personnel cannot operate and maintain in time is solved.
One, system integral structure
The system architecture diagram is shown in fig. 1, and the system mainly comprises three parts, namely a data collection part (log collection server), an algorithm prediction part (fault prediction server) and a visualization display part (visualization front end).
The data collection part mainly collects the machine and application logs of each machine in the cluster, and the main principle is that a lightweight log collection tool is deployed on the monitored machine, the logs are collected on the server through the distributed tools, key performance indexes in the logs are uniformly extracted to monitor the cluster state, and operation and maintenance big data are constructed.
The algorithm prediction part mainly utilizes a machine learning and neural network method to learn prior experience based on collected historical operation and maintenance big data, so as to generate a prediction model meeting the requirements of actual production rules. In the actual operation prediction process, real-time performance data is directly provided to the model as input, and whether a fault occurs in a later period of time is predicted through a learned rule.
The visualization part is used for displaying all log information and prediction information, and in consideration that detailed text reports are not as good as charts and are deeply conscious, the method combines the chart mode to provide real-time information of predicted faults and states for operation and maintenance personnel as reference, so that the operation and maintenance personnel can be helped to efficiently understand and process faults possibly existing in the network cluster.
The operation principle of the whole system is shown in fig. 2, firstly, log collection lightweight components existing on each machine are deployed in a cluster in a distributed manner, all required system log information and application log information on the machines are transmitted back to a specific storage server through a network, a log collection storage tool is deployed on the storage server to receive and store the log information of all machines in the cluster and perform centralized storage and processing, and then key performance index information of each machine is extracted from logs on the storage server through methods such as filtering, keyword extraction and the like to form operation and maintenance big data of all the machines; then on another server, training and learning a model according to the collected historical operation and maintenance big data by using a constructed neural network algorithm to generate an artificial intelligence model according with our condition, and processing the collected real-time operation and maintenance data as the input of the model, thereby predicting whether the machine in the cluster has a fault in the next period of time in real time and completing the fault prediction and positioning work based on the operation and maintenance big data; and finally, constructing a visual operation and maintenance data display platform by using the front-end framework, monitoring the cluster state of people in real time, early warning equipment with predicted fault information, and caching the predicted fault information for later processing. Therefore, operation and maintenance personnel can efficiently understand the cluster state through the icons and provide detailed log information to accurately find the fault and design a fault solution in advance.
Second, data collection part
As shown in fig. 3, the data collection part mainly includes a log collection module and a performance index extraction module, and this part mainly completes the tasks of collecting monitoring information and extracting key performance indexes in the cluster.
A log collection module: to accomplish the tasks of log redirection and storage. The invention refers to the existing mature cluster log monitoring and management framework applied to the bottom layer operation and maintenance work of a large number of companies to complete the task of log collection. The log collection tool and the log storage tool in the mainstream framework ELK are mainly used for completing the work of log redirection centralized processing, so that log information in a cluster is centrally stored and processed, and unified management is facilitated.
A performance index extraction module: used to extract indicators in the log reflecting machine performance. In the scheme, the CPU utilization rate, the memory utilization rate, the disk reading speed and the network flow bandwidth of each machine are mainly collected to measure the state of one machine. The information can extract key performance indexes needed by the user from the system log information of the machine by using key word extraction, filtering and other technologies, so that the state of each machine in the cluster is reflected, and operation and maintenance big data is constructed by using a time sequence extraction method.
Third, algorithm prediction part
As shown in FIG. 4, the algorithm prediction part adaptively selects an appropriate artificial intelligence algorithm according to different kinds and statistical characteristics of the collected operation and maintenance big data, thereby completing the prediction work of different types of data. The part consists of an algorithm self-adapting module and a fault predicting module.
An algorithm self-adaptive module: and screening proper algorithms to learn the prior experience in the operation and maintenance big data. The statistical characteristics and distribution of the operation and maintenance data may also vary greatly as the clustered system is continually iterated and redesigned. If the previous version of algorithm model is used for prediction, the accuracy of prediction is greatly reduced, and a large number of false positives are caused. In order to avoid the situation, the invention adopts a statistical characteristic screening algorithm and a cross validation screening algorithm to train a model; specifically, the historical operation and maintenance big data of the previous day is used for training the model every day, and the specific process is to collect the historical operation and maintenance big data of the previous day as training data and calculate some statistical indexes of the training data to select a proper algorithm. And then processing the training data into a time sequence, dividing the time sequence into a training set and a test set, configuring the algorithm to be selected and learning on the training set by proper parameters, then performing cross validation by using the test set, and selecting the model with the best training effect to be used as an artificial intelligence model for predicting faults on the same day for persistent storage.
A predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using the persistent model. The last module generates a model used for prediction, then processes real-time machine performance data information into operation and maintenance data serving as input and provides the operation and maintenance data to the model, and the model can output whether a machine is likely to have a fault or not and what kind of faults are likely to occur in a certain period of time, so that the tasks of network fault location and prediction are completed.
Fourth, visual display part
As shown in fig. 5, the visualization display portion mainly includes a real-time monitoring module, a history prediction module, a log information retrieval module, and a machine performance curve display module.
A real-time monitoring module: the module is used for displaying the real-time performance state of the machines in the cluster, mainly displaying the real-time CPU utilization rate, the memory utilization rate, the disk throughput rate and the network bandwidth of each machine, and whether a predicted fault occurs in each machine in the next time period, and meanwhile, judging whether the machines are normal by collecting information through methods such as a threshold value, an SNMP and the like.
A history prediction module: the module is mainly used for caching historical prediction information of each machine. After visiting a professional operation and maintenance person, the operation and maintenance person gives a suggestion that the history of each prediction is saved. Once a fault occurs, the prediction information contains important information for operation and maintenance personnel, and can help the operation and maintenance personnel to understand the occurring fault in time.
A log information retrieval module: this section mainly provides the machine performance log information we extract. The information is the main operation and maintenance basis of the current operation and maintenance personnel, and is provided for the convenience of the operation and maintenance personnel to perform the operation and maintenance in a manner familiar to the operation and maintenance personnel. Therefore, when receiving the predicted fault information, the operation and maintenance personnel can quickly check the current log information of the machine, thereby more accurately confirming the exact reason of the fault and preparing a corresponding solution by using the log information and the operation and maintenance knowledge.
The machine performance curve display module: the module is used for displaying the log information to the operation and maintenance personnel in a graph mode. Therefore, operation and maintenance personnel can intuitively find whether the performance curve is abnormal or not so as to quickly determine the fault, and meanwhile, the real-time prediction result is displayed on the performance curve, so that important auxiliary information is provided for the operation and maintenance personnel.
Based on the system, the invention provides a fault location visualization method based on operation and maintenance big data prediction, which comprises the following specific implementation steps:
first, the user needs to install a lightweight log collection tool onto the machines in each network cluster and keep the component running in real time. By modifying the configuration file, the log collection tool collects and redirects system log information and designated application performance log information on the machines, so that relevant information of all machines in the cluster can be collected to monitor the performance condition of the machines in the cluster. Compared with other tools, the log collection tool is simpler and more convenient, the self overhead can be almost ignored, and thus, excessive extra overhead can not be increased when the tool is used in a large amount in a cluster, and the performance of the original network is ensured. Meanwhile, all log information is relocated to a server which is provided with a log storage tool in the cluster, and is stored and convenient for next analysis.
Then, the related logs of all the machines in the cluster are redirected to a log storage server by a distributed log collection tool, and the log storage server can also use a message queue tool to complete the message caching function, so that all the log information can be stored on the server in a centralized manner. Then, in order to complete the monitoring of the machine performance in the cluster, indexes which can reflect the machine performance, such as CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth flow and the like, can be extracted from the system log by using a filtering method or a keyword extraction method. With the indexes, the state of the machines in the cluster can be monitored in real time, and meanwhile, a foundation is provided for the subsequent prediction work.
Thirdly, after collecting performance index data of all machines, processing the performance index data into time sequence data, constructing the time sequence by taking 100 seconds as a window size and 1 second as an interval of two time sequences according to initialization setting, wherein a label of each event sequence is whether a fault occurs within 1000 seconds after the time window, and if the fault occurs, the label is 1, and if the fault does not occur, the label is 0. For example, we now assume thatnTo represent the n-th second of the machine, we now have s1,s2,s3…… s101Data information of time of day, and a fault occurs within 1000 seconds, we construct two time windows, respectively containing s1,s2,s3…… s100Time series of data and inclusion s2,s3,s4…… s101Time series of data, both time series having a tag of 1. Therefore, after all the data are constructed into time sequence data to be stored, statistical information of all time sequence windows is counted to be used as a new characteristic, and operation and maintenance big data are generated.
Fourthly, training an algorithm model to complete the task of fault prediction. Meanwhile, in order to guarantee timeliness and accuracy, all operation and maintenance data of the previous day are divided into a training set and a test set to be used for training a proper model for constructing a model for predicting network faults of the next day, wherein the division ratio of the training set to the test set is defaulted to be 4: 1. and then, in order to complete the algorithm screening of the posture factors, training and cross-verifying each algorithm model with preset parameters by using the obtained training set and test set, and taking the model with the best cross-verifying result as a prediction model used in actual production and storing the model in a persistence manner so as to predict and position the fault in real time in the next step.
Fifthly, in an actual production environment, in order to achieve the goal of real-time fault prediction, real-time data of a certain machine and all data of the first 99 seconds are processed into a time sequence. And then calling a prediction model stored in a persistent mode, taking the processed data as input, wherein the output result of the algorithm is whether a fault occurs on a certain machine in the cluster within 1000 seconds after prediction, so that the work of real-time fault prediction and positioning is completed, and operation and maintenance personnel are helped to confirm potential faults possibly existing in advance.
And sixthly, when the task of fault prediction is completed and a prediction result is generated, the prediction result and the actual performance information are combined to complete the visual display work to help the operation and maintenance personnel to intuitively and simply know the condition of the machines in the cluster. Firstly, displaying real-time performance state indexes (CPU utilization rate, memory utilization rate and the like) and real-time prediction fault results of each machine in a cluster, wherein the indexes mainly complete the task of monitoring the machines in real time, and if an operation and maintenance worker finds that one machine has a prediction fault, the operator can click a machine name to check detailed machine information which comprises all performance log information and performance curve graph information of the machine, wherein the performance curve graph information can help the operation and maintenance worker to visually check performance curves to visually find problems and mark the prediction fault information for reference; after identifying the point in time where a particular performance fault may occur, we can review the particular performance log information to understand the fault in advance and provide a corresponding solution. Finally, in order to enable the operation and maintenance personnel to understand the fault more comprehensively, historical records of the predicted fault are reserved for the operation and maintenance personnel to analyze.

Claims (9)

1. The fault location visualization system based on operation and maintenance data prediction is characterized by comprising a data collection part,
An algorithm prediction part and a visualization display part; wherein:
a data collection part: collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;
and an algorithm prediction part: learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; then, carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;
and a visualization part: and the method is used for displaying all log information and failure prediction information.
2. The fault localization visualization system according to claim 1, wherein the data collection part comprises a log collection module and a performance index extraction module; wherein:
a log collection module: the system comprises a distributed log collection component and a centralized log storage component, wherein the distributed log collection component and the centralized log storage component are used for collecting machine and application logs of all machines in a cluster and finishing the work of log redirection centralized processing;
a performance index extraction module: index information reflecting the performance of the machine in the log is extracted, and operation and maintenance big data are constructed by a time sequence extraction method.
3. The fault location visualization system according to claim 1, wherein in the performance extraction module, index information reflecting machine performance is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.
4. The fault localization visualization system of claim 1 wherein the algorithmic prediction portion comprises an algorithmic adaptation module and a predictive fault module; wherein:
an algorithm self-adaptive module: training a model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data of the previous day to obtain an artificial intelligence model for predicting faults on the same day;
a predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model.
5. The fault location visualization system according to claim 1, wherein the visualization part comprises a real-time monitoring module, a history prediction module, a log information retrieval module and a machine performance curve display module; wherein:
a real-time monitoring module: the system is used for displaying the real-time performance state of the machines in the cluster; the performance state comprises CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth, failure prediction information and machine state obtained by collecting information through a threshold value and an SNMP method;
a history prediction module: the system comprises a cache module, a fault prediction module and a fault prediction module, wherein the cache module is used for caching historical fault prediction information of machines in a cluster;
a log information retrieval module: for providing extracted machine performance log information;
the machine performance curve display module: the system is used for displaying the log information to the operation and maintenance personnel in a graph mode.
6. A fault localization visualization method based on the system of claim 1, characterized by comprising the following specific steps:
(1) collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;
(2) learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;
(3) and displaying all log information and fault prediction information.
7. The fault location visualization method according to claim 6, wherein in the step (1), index information reflecting the performance of the machine is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.
8. The fault location visualization method according to claim 6, wherein in the step (2), the model is trained through a statistical feature screening algorithm and a cross-validation screening algorithm based on historical operation and maintenance big data of the previous day, so as to obtain an artificial intelligence model of the predicted fault of the current day.
9. The fault localization visualization method according to claim 6, wherein in the step (3), the visualization is performed by graph and table.
CN202010079674.0A 2020-02-04 2020-02-04 Fault location visualization system and method based on operation and maintenance data prediction Pending CN111290913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010079674.0A CN111290913A (en) 2020-02-04 2020-02-04 Fault location visualization system and method based on operation and maintenance data prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010079674.0A CN111290913A (en) 2020-02-04 2020-02-04 Fault location visualization system and method based on operation and maintenance data prediction

Publications (1)

Publication Number Publication Date
CN111290913A true CN111290913A (en) 2020-06-16

Family

ID=71023448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010079674.0A Pending CN111290913A (en) 2020-02-04 2020-02-04 Fault location visualization system and method based on operation and maintenance data prediction

Country Status (1)

Country Link
CN (1) CN111290913A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984499A (en) * 2020-08-04 2020-11-24 中国建设银行股份有限公司 Fault detection method and device for big data cluster
CN112149845A (en) * 2020-09-23 2020-12-29 山东通维信息工程有限公司 Intelligent operation and maintenance method based on big data and machine learning
CN112764852A (en) * 2021-01-18 2021-05-07 深圳供电局有限公司 Operation and maintenance safety monitoring method and system for intelligent wave recording master station and computer readable storage medium
CN113777476A (en) * 2021-08-30 2021-12-10 苏州浪潮智能科技有限公司 GPU fault diagnosis system, diagnosis method, equipment and readable storage medium
CN113791926A (en) * 2021-09-18 2021-12-14 平安普惠企业管理有限公司 Intelligent alarm analysis method, device, equipment and storage medium
CN113807716A (en) * 2021-09-23 2021-12-17 宝信软件(武汉)有限公司 Network operation and maintenance automation method based on artificial intelligence
WO2022047658A1 (en) * 2020-09-02 2022-03-10 大连大学 Log anomaly detection system
CN114338344A (en) * 2021-12-27 2022-04-12 北京卓越信通电子股份有限公司 Method for judging and restraining computer network fault and broadcast storm by machine deep learning mode

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method
CN105653444A (en) * 2015-12-23 2016-06-08 北京大学 Internet log data-based software defect failure recognition method and system
CN107332685A (en) * 2017-05-22 2017-11-07 国网安徽省电力公司信息通信分公司 A kind of method based on big data O&M daily record applied in state's net cloud
CN108446200A (en) * 2018-02-07 2018-08-24 福建星瑞格软件有限公司 Server intelligence O&M method based on big data machine learning and computer equipment
CN109492826A (en) * 2018-12-06 2019-03-19 远光软件股份有限公司 A kind of information system operating status Risk Forecast Method based on machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method
CN105653444A (en) * 2015-12-23 2016-06-08 北京大学 Internet log data-based software defect failure recognition method and system
CN107332685A (en) * 2017-05-22 2017-11-07 国网安徽省电力公司信息通信分公司 A kind of method based on big data O&M daily record applied in state's net cloud
CN108446200A (en) * 2018-02-07 2018-08-24 福建星瑞格软件有限公司 Server intelligence O&M method based on big data machine learning and computer equipment
CN109492826A (en) * 2018-12-06 2019-03-19 远光软件股份有限公司 A kind of information system operating status Risk Forecast Method based on machine learning

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984499A (en) * 2020-08-04 2020-11-24 中国建设银行股份有限公司 Fault detection method and device for big data cluster
WO2022047658A1 (en) * 2020-09-02 2022-03-10 大连大学 Log anomaly detection system
CN112149845A (en) * 2020-09-23 2020-12-29 山东通维信息工程有限公司 Intelligent operation and maintenance method based on big data and machine learning
CN112764852A (en) * 2021-01-18 2021-05-07 深圳供电局有限公司 Operation and maintenance safety monitoring method and system for intelligent wave recording master station and computer readable storage medium
CN113777476A (en) * 2021-08-30 2021-12-10 苏州浪潮智能科技有限公司 GPU fault diagnosis system, diagnosis method, equipment and readable storage medium
CN113777476B (en) * 2021-08-30 2024-02-23 苏州浪潮智能科技有限公司 GPU fault diagnosis system, diagnosis method, equipment and readable storage medium
CN113791926A (en) * 2021-09-18 2021-12-14 平安普惠企业管理有限公司 Intelligent alarm analysis method, device, equipment and storage medium
CN113807716A (en) * 2021-09-23 2021-12-17 宝信软件(武汉)有限公司 Network operation and maintenance automation method based on artificial intelligence
CN114338344A (en) * 2021-12-27 2022-04-12 北京卓越信通电子股份有限公司 Method for judging and restraining computer network fault and broadcast storm by machine deep learning mode

Similar Documents

Publication Publication Date Title
CN111290913A (en) Fault location visualization system and method based on operation and maintenance data prediction
CN108415789B (en) Node fault prediction system and method for large-scale hybrid heterogeneous storage system
CN108073497B (en) Multi-index transaction analysis method based on data center data acquisition platform
CN107196804B (en) Alarm centralized monitoring system and method for terminal communication access network of power system
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
CN106301522A (en) The Visual method of fault diagnosis of Remote Sensing Ground Station data receiver task and system
CN113516244B (en) Intelligent operation and maintenance method and device, electronic equipment and storage medium
CN113360358B (en) Method and system for adaptively calculating IT intelligent operation and maintenance health index
CN106598800A (en) Hardware fault analysis system and method
CN110162445A (en) The host health assessment method and device of Intrusion Detection based on host log and performance indicator
CN115225536B (en) Virtual machine abnormality detection method and system based on unsupervised learning
CN106940678B (en) System real-time health degree evaluation and analysis method and device
CN111369094A (en) Alarm order dispatching method, device and system and computer readable storage medium
CN103049365B (en) Information and application resource running state monitoring and evaluation method
CN112783682A (en) Abnormal automatic repairing method based on cloud mobile phone service
CN112817814A (en) Abnormity monitoring method, system, storage medium and electronic device
CN115640860B (en) Electromechanical equipment remote maintenance method and system for industrial cloud service
CN108664696B (en) Method and device for evaluating running state of water chiller
CN116714469A (en) Charging pile health monitoring method, device, terminal and storage medium
CN115438093A (en) Power communication equipment fault judgment method and detection system
CN115509784A (en) Fault detection method and device for database instance
CN110413482B (en) Detection method and device
CN116522213A (en) Service state level classification and classification model training method and electronic equipment
Wang et al. LSTM-based alarm prediction in the mobile communication network
CN113807690A (en) Online evaluation and early warning method and system for operation state of regional power grid regulation and control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200616