CN111290913A

CN111290913A - Fault location visualization system and method based on operation and maintenance data prediction

Info

Publication number: CN111290913A
Application number: CN202010079674.0A
Authority: CN
Inventors: 王子健; 周扬帆; 付娇娇; 陈昊; 蔡煜; 曹袖
Original assignee: Fudan University; CERNET Corp
Current assignee: Fudan University; CERNET Corp
Priority date: 2020-02-04
Filing date: 2020-02-04
Publication date: 2020-06-16

Abstract

The invention discloses a fault location visualization system and method based on operation and maintenance data prediction. The invention collects the machine and application log information in the network cluster by using the existing log collection framework to generate operation and maintenance big data, and then processes the operation and maintenance big data by using an artificial intelligence method, thereby predicting the network fault in advance and displaying the network fault in a visual mode. The invention has the beneficial effects that: the fault can be identified efficiently and accurately, the network fault can be predicted in advance, the operation and maintenance personnel can be given sufficient time to carry out operation and maintenance work in time, and the work efficiency of the operation and maintenance personnel can be improved effectively.

Description

Fault location visualization system and method based on operation and maintenance data prediction

Technical Field

The invention relates to a fault location visualization system and method based on operation and maintenance data prediction, and relates to the technical field of computer networks and intelligent operation and maintenance.

Background

With the gradual maturity of network technology, the coverage area of a wireless network is gradually enlarged, and meanwhile, more and more intelligent devices are arranged at the sides of people, so that the number of devices accessed in the network is increased dramatically. On the basis, the quality of the information service provided by the network is a key factor influencing the user experience, for example, if an access authentication system exists in the network, if a fault occurs, the working efficiency of a user of the access system may be greatly reduced. Based on the above two points, the current network has higher requirements on the accuracy and timeliness of fault location, so that fault detection and location of the network become a key research problem at present.

At present, the frames for positioning and analyzing faults in the network include eSigt, ELK, Splunk, zabbix and the like, and the functions of the fault positioning and analyzing frames are mainly realized in the same way. The method comprises the steps that an application is deployed on a terminal to collect log information on equipment, then key information of the equipment is extracted based on the log information, faults on the equipment are located and analyzed through setting a threshold value and a simple statistical analysis method, and a visual information and information notification mechanism is provided to assist operation and maintenance personnel in operation and maintenance work.

While there are so many network fault analysis and localization frameworks, these solutions also have many drawbacks and deficiencies in the current environment. Firstly, most schemes mainly rely on a threshold value method and error log information extraction to detect faults and position in the aspect of abnormal detection, errors are difficult to find in time by the method, the method is easy to miss reports, and the existing risks are larger and larger under the increasingly strong network environment. Moreover, the threshold value of the frame is generally specified by experts, and the workload of re-understanding the set parameters of the system is overlarge and the accuracy is not high along with the frequent update of the system; the second problem is that the number of access devices is increased dramatically, so that the log information is too large in data size, even if the log information is processed by the current framework, the information size covered by the log information is too large, and it is difficult to assist the operation and maintenance personnel to understand the current network state in time, so that many faults cannot be processed in time under the condition that the operation and maintenance personnel cannot understand the machine state in time, and the operation and maintenance personnel get into a passive operation and maintenance state waiting for the user to feed back and then perform operation and maintenance.

Disclosure of Invention

In view of the defects of the prior art, the invention aims to provide a fault location visualization system and method based on operation and maintenance data prediction. The method and the system efficiently and accurately identify the fault based on the artificial intelligence technology, and can predict the network fault in advance, thereby giving sufficient time to operation and maintenance personnel to carry out operation and maintenance work in time, and effectively assisting in improving the working efficiency of the operation and maintenance personnel.

The technical scheme of the invention is specifically introduced as follows.

A fault location visualization system based on operation and maintenance data prediction comprises a data collection part, an algorithm prediction part and a visualization display part; wherein:

a data collection part: collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;

and an algorithm prediction part: learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; then, carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;

and the visualization part is used for displaying all log information and fault prediction information.

In the invention, the data collection part comprises a log collection module and a performance index extraction module; wherein:

a log collection module: the system comprises a distributed log collection component and a centralized log storage component, wherein the distributed log collection component collects machine and application logs of all machines in a cluster, and completes the work of log redirection centralized processing;

a performance index extraction module: index information reflecting the performance of the machine in the log is extracted, and operation and maintenance big data are constructed by a time sequence extraction method.

In the invention, in the performance index extraction module, index information reflecting the performance of the machine is extracted by a filtering and keyword extraction method; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

In the invention, the algorithm prediction part comprises an algorithm self-adaptive module and a prediction fault module; wherein:

an algorithm self-adaptive module: training a model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data of the previous day to obtain an artificial intelligence model for predicting faults on the same day;

a predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model.

In the invention, a visualization part comprises a real-time monitoring module, a history prediction module, a log information retrieval module and a machine performance curve display module; wherein:

a real-time monitoring module: the system is used for displaying the real-time performance state of the machines in the cluster; the performance state comprises CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth, failure prediction information and machine state obtained by collecting information through a threshold value and an SNMP method;

a history prediction module: the system comprises a cache module, a fault prediction module and a fault prediction module, wherein the cache module is used for caching historical fault prediction information of machines in a cluster;

a log information retrieval module: for providing extracted machine performance log information;

the machine performance curve display module: the system is used for displaying the log information to the operation and maintenance personnel in a graph mode.

The invention also provides a fault location visualization method based on the system, which comprises the following specific steps:

(1) collecting machine and application logs of each machine in the cluster, uniformly extracting key performance indexes used for monitoring cluster states, and constructing operation and maintenance big data by using a time sequence extraction method;

(2) learning prior experience based on collected historical operation and maintenance big data by using a machine learning and neural network method, so as to generate an artificial intelligence prediction model meeting the requirements of an actual production rule; carrying out fault prediction on the operation and maintenance data acquired in real time by using an artificial intelligence model;

(3) and displaying all log information and fault prediction information.

In the invention, in the step (1), index information reflecting the performance of the machine is extracted by a filtering and keyword extracting method; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

In the invention, in the step (2), the artificial intelligence model for predicting the faults on the same day is obtained by training the model through a statistical feature screening algorithm and a cross validation screening algorithm based on historical operation and maintenance big data on the previous day.

In the present invention, step (3) is shown in the form of a graph and a table.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, accurate prediction and positioning of network faults can be realized through a machine learning and neural network method and a training model obtained by learning the prior experience based on collected historical operation and maintenance big data;

the invention displays the conventional log monitoring information and the predicted fault information of each machine through the visual operation and maintenance information platform, and can effectively and intuitively finish the operation and maintenance task by operation and maintenance personnel.

Drawings

Fig. 1 is a system architecture diagram.

Fig. 2 is a flow chart of the system operation.

Fig. 3 is a diagram showing a structure of a data collection portion.

FIG. 4 is a block diagram of the prediction portion of the algorithm.

Fig. 5 shows a part of the structure of the image.

Detailed Description

The technical solution of the present invention will be described in detail with reference to the accompanying drawings and embodiments.

The invention firstly utilizes a distributed log collection framework to collect machine log information on each device, then extracts key information in the machine log information as a performance index of the machine at that time, and constructs operation and maintenance big data. And then, the processed operation and maintenance big data is used for training an artificial intelligence model, when real-time latest performance data is input, the model can predict whether the machines have faults in a later period of time, and the faults are visualized to operation and maintenance personnel in the form of diagrams and tables, so that the operation and maintenance personnel can be helped to determine problems in advance and prepare solutions, and the problem that the existing network complex operation and maintenance personnel cannot operate and maintain in time is solved.

One, system integral structure

The system architecture diagram is shown in fig. 1, and the system mainly comprises three parts, namely a data collection part (log collection server), an algorithm prediction part (fault prediction server) and a visualization display part (visualization front end).

The data collection part mainly collects the machine and application logs of each machine in the cluster, and the main principle is that a lightweight log collection tool is deployed on the monitored machine, the logs are collected on the server through the distributed tools, key performance indexes in the logs are uniformly extracted to monitor the cluster state, and operation and maintenance big data are constructed.

The algorithm prediction part mainly utilizes a machine learning and neural network method to learn prior experience based on collected historical operation and maintenance big data, so as to generate a prediction model meeting the requirements of actual production rules. In the actual operation prediction process, real-time performance data is directly provided to the model as input, and whether a fault occurs in a later period of time is predicted through a learned rule.

The visualization part is used for displaying all log information and prediction information, and in consideration that detailed text reports are not as good as charts and are deeply conscious, the method combines the chart mode to provide real-time information of predicted faults and states for operation and maintenance personnel as reference, so that the operation and maintenance personnel can be helped to efficiently understand and process faults possibly existing in the network cluster.

The operation principle of the whole system is shown in fig. 2, firstly, log collection lightweight components existing on each machine are deployed in a cluster in a distributed manner, all required system log information and application log information on the machines are transmitted back to a specific storage server through a network, a log collection storage tool is deployed on the storage server to receive and store the log information of all machines in the cluster and perform centralized storage and processing, and then key performance index information of each machine is extracted from logs on the storage server through methods such as filtering, keyword extraction and the like to form operation and maintenance big data of all the machines; then on another server, training and learning a model according to the collected historical operation and maintenance big data by using a constructed neural network algorithm to generate an artificial intelligence model according with our condition, and processing the collected real-time operation and maintenance data as the input of the model, thereby predicting whether the machine in the cluster has a fault in the next period of time in real time and completing the fault prediction and positioning work based on the operation and maintenance big data; and finally, constructing a visual operation and maintenance data display platform by using the front-end framework, monitoring the cluster state of people in real time, early warning equipment with predicted fault information, and caching the predicted fault information for later processing. Therefore, operation and maintenance personnel can efficiently understand the cluster state through the icons and provide detailed log information to accurately find the fault and design a fault solution in advance.

Second, data collection part

As shown in fig. 3, the data collection part mainly includes a log collection module and a performance index extraction module, and this part mainly completes the tasks of collecting monitoring information and extracting key performance indexes in the cluster.

A log collection module: to accomplish the tasks of log redirection and storage. The invention refers to the existing mature cluster log monitoring and management framework applied to the bottom layer operation and maintenance work of a large number of companies to complete the task of log collection. The log collection tool and the log storage tool in the mainstream framework ELK are mainly used for completing the work of log redirection centralized processing, so that log information in a cluster is centrally stored and processed, and unified management is facilitated.

A performance index extraction module: used to extract indicators in the log reflecting machine performance. In the scheme, the CPU utilization rate, the memory utilization rate, the disk reading speed and the network flow bandwidth of each machine are mainly collected to measure the state of one machine. The information can extract key performance indexes needed by the user from the system log information of the machine by using key word extraction, filtering and other technologies, so that the state of each machine in the cluster is reflected, and operation and maintenance big data is constructed by using a time sequence extraction method.

Third, algorithm prediction part

As shown in FIG. 4, the algorithm prediction part adaptively selects an appropriate artificial intelligence algorithm according to different kinds and statistical characteristics of the collected operation and maintenance big data, thereby completing the prediction work of different types of data. The part consists of an algorithm self-adapting module and a fault predicting module.

An algorithm self-adaptive module: and screening proper algorithms to learn the prior experience in the operation and maintenance big data. The statistical characteristics and distribution of the operation and maintenance data may also vary greatly as the clustered system is continually iterated and redesigned. If the previous version of algorithm model is used for prediction, the accuracy of prediction is greatly reduced, and a large number of false positives are caused. In order to avoid the situation, the invention adopts a statistical characteristic screening algorithm and a cross validation screening algorithm to train a model; specifically, the historical operation and maintenance big data of the previous day is used for training the model every day, and the specific process is to collect the historical operation and maintenance big data of the previous day as training data and calculate some statistical indexes of the training data to select a proper algorithm. And then processing the training data into a time sequence, dividing the time sequence into a training set and a test set, configuring the algorithm to be selected and learning on the training set by proper parameters, then performing cross validation by using the test set, and selecting the model with the best training effect to be used as an artificial intelligence model for predicting faults on the same day for persistent storage.

A predictive failure module: and carrying out fault prediction on the operation and maintenance data acquired in real time by using the persistent model. The last module generates a model used for prediction, then processes real-time machine performance data information into operation and maintenance data serving as input and provides the operation and maintenance data to the model, and the model can output whether a machine is likely to have a fault or not and what kind of faults are likely to occur in a certain period of time, so that the tasks of network fault location and prediction are completed.

Fourth, visual display part

As shown in fig. 5, the visualization display portion mainly includes a real-time monitoring module, a history prediction module, a log information retrieval module, and a machine performance curve display module.

A real-time monitoring module: the module is used for displaying the real-time performance state of the machines in the cluster, mainly displaying the real-time CPU utilization rate, the memory utilization rate, the disk throughput rate and the network bandwidth of each machine, and whether a predicted fault occurs in each machine in the next time period, and meanwhile, judging whether the machines are normal by collecting information through methods such as a threshold value, an SNMP and the like.

A history prediction module: the module is mainly used for caching historical prediction information of each machine. After visiting a professional operation and maintenance person, the operation and maintenance person gives a suggestion that the history of each prediction is saved. Once a fault occurs, the prediction information contains important information for operation and maintenance personnel, and can help the operation and maintenance personnel to understand the occurring fault in time.

A log information retrieval module: this section mainly provides the machine performance log information we extract. The information is the main operation and maintenance basis of the current operation and maintenance personnel, and is provided for the convenience of the operation and maintenance personnel to perform the operation and maintenance in a manner familiar to the operation and maintenance personnel. Therefore, when receiving the predicted fault information, the operation and maintenance personnel can quickly check the current log information of the machine, thereby more accurately confirming the exact reason of the fault and preparing a corresponding solution by using the log information and the operation and maintenance knowledge.

The machine performance curve display module: the module is used for displaying the log information to the operation and maintenance personnel in a graph mode. Therefore, operation and maintenance personnel can intuitively find whether the performance curve is abnormal or not so as to quickly determine the fault, and meanwhile, the real-time prediction result is displayed on the performance curve, so that important auxiliary information is provided for the operation and maintenance personnel.

Based on the system, the invention provides a fault location visualization method based on operation and maintenance big data prediction, which comprises the following specific implementation steps:

first, the user needs to install a lightweight log collection tool onto the machines in each network cluster and keep the component running in real time. By modifying the configuration file, the log collection tool collects and redirects system log information and designated application performance log information on the machines, so that relevant information of all machines in the cluster can be collected to monitor the performance condition of the machines in the cluster. Compared with other tools, the log collection tool is simpler and more convenient, the self overhead can be almost ignored, and thus, excessive extra overhead can not be increased when the tool is used in a large amount in a cluster, and the performance of the original network is ensured. Meanwhile, all log information is relocated to a server which is provided with a log storage tool in the cluster, and is stored and convenient for next analysis.

Then, the related logs of all the machines in the cluster are redirected to a log storage server by a distributed log collection tool, and the log storage server can also use a message queue tool to complete the message caching function, so that all the log information can be stored on the server in a centralized manner. Then, in order to complete the monitoring of the machine performance in the cluster, indexes which can reflect the machine performance, such as CPU utilization rate, memory utilization rate, disk throughput rate, network bandwidth flow and the like, can be extracted from the system log by using a filtering method or a keyword extraction method. With the indexes, the state of the machines in the cluster can be monitored in real time, and meanwhile, a foundation is provided for the subsequent prediction work.

Thirdly, after collecting performance index data of all machines, processing the performance index data into time sequence data, constructing the time sequence by taking 100 seconds as a window size and 1 second as an interval of two time sequences according to initialization setting, wherein a label of each event sequence is whether a fault occurs within 1000 seconds after the time window, and if the fault occurs, the label is 1, and if the fault does not occur, the label is 0. For example, we now assume that_nTo represent the n-th second of the machine, we now have s₁，s₂，s₃…… s₁₀₁Data information of time of day, and a fault occurs within 1000 seconds, we construct two time windows, respectively containing s₁，s₂，s₃…… s₁₀₀Time series of data and inclusion s₂，s₃，s₄…… s₁₀₁Time series of data, both time series having a tag of 1. Therefore, after all the data are constructed into time sequence data to be stored, statistical information of all time sequence windows is counted to be used as a new characteristic, and operation and maintenance big data are generated.

Fourthly, training an algorithm model to complete the task of fault prediction. Meanwhile, in order to guarantee timeliness and accuracy, all operation and maintenance data of the previous day are divided into a training set and a test set to be used for training a proper model for constructing a model for predicting network faults of the next day, wherein the division ratio of the training set to the test set is defaulted to be 4: 1. and then, in order to complete the algorithm screening of the posture factors, training and cross-verifying each algorithm model with preset parameters by using the obtained training set and test set, and taking the model with the best cross-verifying result as a prediction model used in actual production and storing the model in a persistence manner so as to predict and position the fault in real time in the next step.

Fifthly, in an actual production environment, in order to achieve the goal of real-time fault prediction, real-time data of a certain machine and all data of the first 99 seconds are processed into a time sequence. And then calling a prediction model stored in a persistent mode, taking the processed data as input, wherein the output result of the algorithm is whether a fault occurs on a certain machine in the cluster within 1000 seconds after prediction, so that the work of real-time fault prediction and positioning is completed, and operation and maintenance personnel are helped to confirm potential faults possibly existing in advance.

And sixthly, when the task of fault prediction is completed and a prediction result is generated, the prediction result and the actual performance information are combined to complete the visual display work to help the operation and maintenance personnel to intuitively and simply know the condition of the machines in the cluster. Firstly, displaying real-time performance state indexes (CPU utilization rate, memory utilization rate and the like) and real-time prediction fault results of each machine in a cluster, wherein the indexes mainly complete the task of monitoring the machines in real time, and if an operation and maintenance worker finds that one machine has a prediction fault, the operator can click a machine name to check detailed machine information which comprises all performance log information and performance curve graph information of the machine, wherein the performance curve graph information can help the operation and maintenance worker to visually check performance curves to visually find problems and mark the prediction fault information for reference; after identifying the point in time where a particular performance fault may occur, we can review the particular performance log information to understand the fault in advance and provide a corresponding solution. Finally, in order to enable the operation and maintenance personnel to understand the fault more comprehensively, historical records of the predicted fault are reserved for the operation and maintenance personnel to analyze.

Claims

1. The fault location visualization system based on operation and maintenance data prediction is characterized by comprising a data collection part,

An algorithm prediction part and a visualization display part; wherein:

and a visualization part: and the method is used for displaying all log information and failure prediction information.

2. The fault localization visualization system according to claim 1, wherein the data collection part comprises a log collection module and a performance index extraction module; wherein:

a log collection module: the system comprises a distributed log collection component and a centralized log storage component, wherein the distributed log collection component and the centralized log storage component are used for collecting machine and application logs of all machines in a cluster and finishing the work of log redirection centralized processing;

3. The fault location visualization system according to claim 1, wherein in the performance extraction module, index information reflecting machine performance is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

4. The fault localization visualization system of claim 1 wherein the algorithmic prediction portion comprises an algorithmic adaptation module and a predictive fault module; wherein:

5. The fault location visualization system according to claim 1, wherein the visualization part comprises a real-time monitoring module, a history prediction module, a log information retrieval module and a machine performance curve display module; wherein:

6. A fault localization visualization method based on the system of claim 1, characterized by comprising the following specific steps:

(3) and displaying all log information and fault prediction information.

7. The fault location visualization method according to claim 6, wherein in the step (1), index information reflecting the performance of the machine is extracted by filtering and keyword extraction methods; the index information of the machine performance comprises CPU utilization rate information, memory utilization rate information, disk reading bandwidth information and network flow information.

8. The fault location visualization method according to claim 6, wherein in the step (2), the model is trained through a statistical feature screening algorithm and a cross-validation screening algorithm based on historical operation and maintenance big data of the previous day, so as to obtain an artificial intelligence model of the predicted fault of the current day.

9. The fault localization visualization method according to claim 6, wherein in the step (3), the visualization is performed by graph and table.