CN110955586A

CN110955586A - System fault prediction method, device and equipment based on log

Info

Publication number: CN110955586A
Application number: CN201911181749.XA
Authority: CN
Inventors: 代朝
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-04-03

Abstract

The invention provides a system fault prediction method, a system fault prediction device and a system fault prediction device based on logs, wherein the method comprises the following steps: acquiring system log data according to a preset data capturing rule; performing data classification on the acquired log data to obtain abnormal data, performance data and service data; and analyzing the change trends of the abnormal data, the performance data and the service data based on a preset artificial intelligence model, and outputting an analysis result, thereby realizing the fault prediction of the system and reducing the operation and maintenance cost of the system.

Description

System fault prediction method, device and equipment based on log

Technical Field

The invention relates to the technical field of data processing, in particular to a system fault prediction method, device and equipment based on logs.

Background

In the prior art, when a computer system or other systems have faults, the type of the system fault is determined according to log analysis results by performing log analysis on the computer system or other systems, and historical operating data of the system is stored in the logs.

Therefore, in the prior art, log analysis of the current system is passive, log analysis is performed only after problems occur in the production process, system parameters and a deployment strategy are adjusted according to an analysis result, and when the problems occur, the problems are solved, so that the operation and maintenance cost of the system is high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, and a device for predicting a system fault based on a log, so as to implement fault prediction of a system.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

a log-based system failure prediction method comprises the following steps:

acquiring system log data according to a preset data capturing rule;

performing data classification on the acquired log data to obtain abnormal data, performance data and service data;

and analyzing the change trends of the abnormal data, the performance data and the service data based on a preset artificial intelligence model, and outputting an analysis result.

Optionally, in the log-based system fault prediction method, analyzing the change trend of the abnormal data based on a preset artificial intelligence model includes:

analyzing the increment of the abnormal data based on a preset artificial intelligence model to obtain the moment when the increment of the abnormal data reaches a preset warning value, wherein the increment of the abnormal data comprises but is not limited to the occurrence frequency of the abnormal data in a preset time period and the difference value of the abnormal data and the value of the normal data.

Optionally, in the log-based system fault prediction method, analyzing the variation trend of the performance data based on a preset artificial intelligence model includes:

analyzing the variation trend of the performance data of the system based on a preset artificial intelligence model to obtain the time when the performance data reaches a preset performance threshold, wherein the performance data comprises but is not limited to a system memory and a CPU utilization rate.

Optionally, in the log-based system fault prediction method, analyzing the change trend of the service data based on a preset artificial intelligence model includes:

analyzing the change trend of the business data of the system based on a preset artificial intelligence model to obtain the change trend of the data volume of various business data.

Optionally, in the log-based system fault prediction method, the obtained performance data is compared with a preset performance threshold, and when the performance data reaches the preset performance threshold, an expansion request is output to the upper-level system, so as to increase system resources of the system.

A log-based system failure prediction apparatus, comprising:

the log data capturing unit is used for acquiring system log data according to a preset data capturing rule;

the data classification unit is used for performing data classification on the acquired log data to obtain abnormal data, performance data and service data;

and the data analysis unit is used for analyzing the change trends of the abnormal data, the performance data and the service data based on a preset artificial intelligence model and outputting an analysis result.

Optionally, in the log-based system failure prediction apparatus, when the data analysis unit analyzes the variation trend of the abnormal data based on a preset artificial intelligence model, the data analysis unit is specifically configured to:

Optionally, in the log-based system failure prediction apparatus, when the data analysis unit analyzes the variation trend of the performance data based on a preset artificial intelligence model, the data analysis unit is specifically configured to:

Optionally, in the log-based system failure prediction apparatus, when the data analysis unit analyzes the change trend of the service data based on a preset artificial intelligence model, the data analysis unit is specifically configured to:

Optionally, in the log-based system failure prediction apparatus, the data analysis unit is further configured to: and comparing the acquired performance data with a preset performance threshold, and outputting an expansion request to the upper-level system when the performance data reaches the preset performance threshold so as to increase system resources of the system.

A log-based system failure prediction device, comprising:

a memory and a processor;

the memory is configured to store program code, and the processor is configured to invoke the program code and, when executed, implement any of the log-based system failure prediction methods described above.

Based on the technical scheme, the technical scheme provided by the embodiment of the invention adopts the preset artificial intelligence model to predict the change trend of the abnormal data, the performance data and the service data to obtain the change trend of the abnormal data, the performance data and the service data in a future period of time, so that the early warning of the system working condition is realized, and the system operation and maintenance personnel can perform targeted management and maintenance on the system according to the change trend of the abnormal data, the performance data and the service data, thereby reducing the operation and maintenance cost of the system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating a log-based system failure prediction method disclosed in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating the prediction of abnormal data by the log-based system failure prediction method according to the embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating performance data prediction performed by the log-based system failure prediction method according to the embodiment of the present disclosure;

fig. 4 is a schematic flowchart illustrating a process of predicting service data by the log-based system failure prediction method according to the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a log-based system failure prediction apparatus disclosed in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a log-based system failure prediction device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problem that log data are analyzed only when a system fails in the prior art and great loss is caused to production, the application discloses a log-based system failure prediction method, which comprises the following steps:

step S101: acquiring system log data according to a preset data capturing rule;

in the scheme, a plurality of log files are stored in the system, and different log files are used for storing log information of different modules in the system, in the scheme, different data capture rules are configured for each log file in advance, and when the log data are captured, the log data are captured from the log files based on the preset capture rules, wherein the log files include but are not limited to middleware logs (Apache, jboss and the like), application logs, system operation indexes, system operation logs and the like;

step S102: performing data classification on the acquired log data to obtain abnormal data, performance data and service data;

in this step, the data processing module is adopted to carry out data filtering and processing on the collected log data, and the log data processed finally is divided into three parts: abnormal data, performance data, and business data;

the abnormal data can include one or more of the information of the time when the abnormality of the abnormal data occurs, the abnormal content, the abnormal operation, the abnormal level, whether the abnormality is recoverable abnormality and the like besides the abnormal data;

the performance data can comprise the response time of the request data acquired by the system and other data used for representing the response capability of the system to the data besides the system performance data;

the service data may include information such as request time distribution and request frequency of different service data, in addition to the service data itself processed by the system.

In order to facilitate data management, in the technical solution disclosed in the embodiment of the present application, a data table corresponding to the abnormal data, the performance data, and the service data one to one is further configured, and the abnormal data, the performance data, and the service data may be stored in the data table corresponding to the abnormal data, the performance data, and the service data.

Step S103: analyzing the change trends of the abnormal data, the performance data and the service data based on a preset artificial intelligence model, and outputting an analysis result;

the artificial intelligence model is a commonly used data processing and prediction scheme in the prior art and is obtained by constructing a basic model and then training the basic model by adopting training data. Based on different processed data objects, the basic models preloaded by the artificial intelligence models are different, but the training processes are basically the same, before the scheme is executed, the artificial intelligence models corresponding to the abnormal data, the performance data and the service data one by one can be pre-constructed, after the abnormal data, the performance data and the service data are obtained, the abnormal data, the performance data and the service data are loaded into the corresponding preset artificial intelligence models, the preset artificial intelligence models are adopted to predict the variation trend of the abnormal data, the performance data and the service data, and the variation trend of the abnormal data, the performance data and the service data in a future period of time is obtained, so that early warning of the system working condition is realized, and system operation and maintenance personnel can manage and maintain the system in a targeted manner according to the variation trend of the abnormal data, the performance data and the service data, the operation and maintenance cost of the system is reduced.

Further, in the technical solution disclosed in the embodiment of the present application, the abnormal data are different in type, and the standard for monitoring the abnormal data is also different, and generally speaking, the abnormal data may be classified into size monitoring of an abnormal data value and frequency monitoring of occurrence of the abnormal data, for this reason, when the abnormal data is analyzed by using the preset artificial intelligence model, the abnormal data may be classified in advance according to a monitoring object of the abnormal data, the preset artificial intelligence model analyzes the abnormal data of the same type each time, and an output result of the preset artificial intelligence model may include a growth rate of the abnormal data value of the abnormal data input to the preset artificial intelligence model this time and the frequency of occurrence of the abnormal data.

Referring to fig. 2, in the above method, analyzing the variation trend of the abnormal data based on a preset artificial intelligence model may include:

step S1031, analyzing the increment of the loaded abnormal data based on a preset artificial intelligence model, and predicting to obtain an increment trend curve of the abnormal data in a future preset time;

step S1032: and acquiring the moment when the increment of the abnormal data reaches a preset warning value from the increment curve of the abnormal data.

The increase amount of the abnormal data includes, but is not limited to, the frequency of the abnormal data, i.e., the number of occurrences of the abnormal data within a preset time period, and the difference between the value of the abnormal data and the value of the normal data. In the scheme, the occurrence frequency growth trend of the loaded abnormal data and the change trend of the size of the abnormal data are predicted through the preset artificial intelligence model, the change trend of the abnormal data in a certain time period in the future is obtained, the time when the value of the abnormal data reaches the preset warning value and the occurrence frequency of the abnormal data reaches the preset warning frequency is predicted, the system fault is predicted, and the system can be effectively prevented from being shut down due to faults.

In this scheme, as the amount of data processed by the system is larger and larger, the resource requirements of the system, such as memory, CPU, and the like, are also higher and higher, in this scheme, the fault not only refers to a system fault caused by a data processing error in a computer data processing process, but also includes a fault caused by insufficient system data processing capability, for example, a response timeout of some data or request is caused by too slow system data processing speed, and therefore, in this scheme, the system performance may be detected to predict a change trend of the system to the processing speed and the response speed of the data, in this case, referring to fig. 3, in the above scheme, analyzing the change trend of the performance data based on the preset artificial intelligence model may include:

step S1033: analyzing the change trend of the performance data of the system based on a preset artificial intelligence model, and predicting to obtain a change trend curve of the performance data in a future preset time;

step S1034: and acquiring the time when the increase of the performance data reaches a preset performance threshold value from the increase trend curve of the performance data.

The performance data includes, but is not limited to, a system memory and a CPU usage rate, that is, when the performance data is predicted, the performance data also needs to be classified in advance, and each time the same type of performance data is loaded into the preset artificial intelligence model, for example, the performance data used for representing the memory usage rate of the system is loaded into the preset artificial intelligence model, a change trend of the memory usage rate of the system is predicted by using the preset artificial intelligence model, an increase trend of the memory usage rate of the system is obtained, and a time when the memory usage rate of the system reaches the memory usage rate is obtained based on the increase trend; loading performance data for representing the CPU utilization rate of the system to the preset artificial intelligence model, predicting the change trend of the CPU utilization rate of the system by adopting the preset artificial intelligence model to obtain the increase trend of the CPU utilization rate of the system, and obtaining the moment when the CPU utilization rate of the system reaches the memory utilization rate based on the increase trend.

By adopting the scheme, the trend of the user demand can be predicted according to the change trend of the business data, so that the system end can reasonably distribute system resources, and similarly, in the scheme, when the change trend of the business data is predicted, the business data also needs to be classified, the same type of business data is loaded into the preset artificial intelligence model at each time, and the analysis of the change trend of the business data based on the preset artificial intelligence model comprises the following steps: the change trend of the business data of the system is analyzed based on a preset artificial intelligence model to obtain the change trend of the data volume of various business data, the change trend of the data volume can indicate the change condition of the business data required by a user in a period of time in the future, and the system can reasonably adjust the system resources occupied by each business according to the change trend of the business data.

Further, in the technical solution disclosed in the embodiment of the present application, when analyzing the performance data and detecting that the performance data of the system has reached the preset performance threshold, an expansion request may be sent to a superior system to increase system resources of the system to which the method is applied, that is, referring to fig. 4, the method may further include:

step S104: comparing the acquired performance data with a preset performance threshold value, and judging whether the performance data is greater than the preset performance threshold value; if yes, go to step S105;

step S105: when the performance data reaches the preset performance threshold, outputting an expansion request to an upper-level system so as to increase system resources of the system;

in this scheme, the determining whether the performance data is greater than the preset performance threshold may refer to a case that the performance data acquired within a preset time period is greater than the preset performance threshold, or a case that an average value of the performance data acquired within the preset time period is greater than the preset performance threshold.

Corresponding to the above method, the present application also discloses a log-based system failure prediction apparatus, and in this embodiment, specific working contents of each unit of the log-based system failure prediction apparatus refer to the contents of the above method embodiment, and the log-based system failure prediction apparatus provided in the embodiment of the present invention is described below, and the log-based system failure prediction apparatus described below and the log-based system failure prediction method described above may be referred to correspondingly.

Referring to fig. 5, the log-based system failure prediction apparatus disclosed in the embodiment of the present application may include:

a log data capture unit 100, a data classification unit 200, and a data analysis unit 300;

corresponding to the above method, the log data capture unit 100 is configured to obtain system log data according to a preset data capture rule;

corresponding to the above method, the data classification unit 200 is configured to perform data classification on the acquired log data to obtain abnormal data, performance data, and service data, specifically, the abnormal data may include, in addition to the abnormal data itself, one or more of information such as time of occurrence of an abnormality of the abnormal data, abnormal content, abnormal operation, abnormal level, and whether the abnormality is a recoverable abnormality; the performance data can comprise the response time of the request data acquired by the system and other data used for representing the response capability of the system to the data besides the system performance data; the service data may include information such as request time distribution and request frequency of different service data, in addition to the service data itself processed by the system. In this scheme, in order to facilitate data management, the data classification unit 200 is further configured with a data table corresponding to the abnormal data, the performance data, and the service data one to one, and the abnormal data, the performance data, and the service data may all be stored in the data table corresponding to them;

corresponding to the above method, the data analysis unit 300 is configured to analyze the variation trend of the abnormal data, the performance data, and the service data based on a preset artificial intelligence model, and output an analysis result.

In this embodiment, the data analysis unit 300 stores therein artificial intelligence models corresponding to the abnormal data, the performance data, and the service data one by one, after the abnormal data, the performance data and the service data are obtained, loading the abnormal data, the performance data and the service data into a corresponding preset artificial intelligence model, adopting the preset artificial intelligence model to predict the change trend of the abnormal data, the performance data and the service data, the abnormal data, the performance data and the change trend of the service data in a period of time in the future can be obtained, therefore, early warning on the system working condition is realized, and system operation and maintenance personnel can perform targeted management and maintenance on the system according to the change trend of the abnormal data, the performance data and the service data, so that the operation and maintenance cost of the system is reduced.

Corresponding to the above method, when the data analysis unit analyzes the variation trend of the abnormal data based on a preset artificial intelligence model, the abnormal data may be classified in advance according to a monitoring object of the abnormal data, the preset artificial intelligence model analyzes the abnormal data of the same type each time, an output result of the preset artificial intelligence model may include a growth rate of an abnormal data value of the abnormal data input into the preset artificial intelligence model this time and an occurrence frequency of the abnormal data, and the analyzing, by the data analysis unit 300, the variation trend of the abnormal data based on the preset artificial intelligence model may include: analyzing the increment of the loaded abnormal data based on a preset artificial intelligence model, and predicting to obtain the moment when the increment of the abnormal data reaches a preset warning value, wherein the increment of the abnormal data comprises but is not limited to the frequency of abnormal data occurrence, namely the frequency of abnormal data occurrence in a preset time period, and the difference value of the abnormal data and the value of normal data. The data analysis unit 300 predicts the occurrence frequency growth trend of the loaded abnormal data and the change trend of the size of the abnormal data through the preset artificial intelligence model to obtain the change trend of the abnormal data within a certain period of time in the future, and predicts the time when the value of the abnormal data reaches the preset warning value and the occurrence frequency reaches the preset warning frequency, so that the system fault is predicted, and the system can be effectively prevented from being shut down due to the fault.

Corresponding to the above method, when the data analysis unit analyzes the variation trend of the performance data based on a preset artificial intelligence model, the data analysis unit is specifically configured to:

analyzing the variation trend of the performance data of the system based on a preset artificial intelligence model to obtain the time when the performance data reaches a preset performance threshold value, wherein the performance data comprises but is not limited to system memory and CPU utilization rate, that is, as in the above-described method embodiment, the data analysis unit, when predicting the performance data, the performance data also needs to be classified in advance, each time the same type of performance data is loaded into the preset artificial intelligence model, for example, loading performance data for representing the memory usage rate of the system to the preset artificial intelligence model, predicting the change trend of the memory usage rate of the system by using the preset artificial intelligence model to obtain the increase trend of the memory usage rate of the system, and obtaining the time when the memory usage rate of the system reaches the memory usage rate based on the increase trend; loading performance data for representing the CPU utilization rate of the system to the preset artificial intelligence model, predicting the change trend of the CPU utilization rate of the system by adopting the preset artificial intelligence model to obtain the increase trend of the CPU utilization rate of the system, and obtaining the moment when the CPU utilization rate of the system reaches the memory utilization rate based on the increase trend.

Corresponding to the above method, when the data analysis unit analyzes the change trend of the service data based on a preset artificial intelligence model, the data analysis unit is specifically configured to:

the change trend of the business data of the system is analyzed based on a preset artificial intelligence model to obtain the change trend of the data volume of various business data, the change trend of the data volume can indicate the change condition of the business data required by a user in a period of time in the future, and the system can reasonably adjust the system resources occupied by each business according to the change trend of the business data. As in the foregoing method embodiment, when the service data is obtained, the data analysis unit may also classify the service data first.

Further, in the above scheme disclosed in the embodiment of the present application, the data analysis unit may be further configured to: and comparing the acquired performance data with a preset performance threshold, and outputting an expansion request to the upper-level system when the performance data reaches the preset performance threshold so as to increase system resources of the system.

Further, referring to fig. 5, the method disclosed in the embodiment of the present application may further include a log display system, where the log display system is configured to display the prediction result of the data analysis unit 300.

Correspondingly, the present application also discloses a log-based system failure prediction device, referring to fig. 6, the device may include:

a memory 400 and a processor 500;

the jamming device further comprises a communication interface 600 and a communication bus 700, wherein the memory 400, the processor 500 and the communication interface 600 are all in communication with each other through the communication bus 700.

The memory 400 is used for storing program codes; the program code includes computer operational instructions.

Memory 400 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 500 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The processor 500 is configured to call the program code, and when the program code is executed, is configured to perform the method according to any of the embodiments of the present application.

For convenience of description, the above system is described with the functions divided into various modules, which are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the invention.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A log-based system failure prediction method is characterized by comprising the following steps:

acquiring system log data according to a preset data capturing rule;

2. The log-based system failure prediction method of claim 1, wherein analyzing the variation trend of the abnormal data based on a preset artificial intelligence model comprises:

3. The log-based system failure prediction method of claim 1, wherein analyzing the trend of the performance data based on a preset artificial intelligence model comprises:

analyzing the variation trend of the performance data of the system based on a preset artificial intelligence model to obtain the time when the performance data reaches a preset performance threshold, wherein the performance data comprises but is not limited to the system memory utilization rate and the CPU utilization rate.

4. The log-based system failure prediction method of claim 1, wherein analyzing the change trend of the business data based on a preset artificial intelligence model comprises:

5. The log-based system failure prediction method of claim 1, further comprising:

and comparing the acquired performance data with a preset performance threshold, and outputting an expansion request to an upper-level system when the performance data reaches the preset performance threshold.

6. A log-based system failure prediction apparatus, comprising:

7. The log-based system failure prediction device of claim 6, wherein the data analysis unit, when analyzing the variation trend of the abnormal data based on a preset artificial intelligence model, is specifically configured to:

8. The log-based system failure prediction device of claim 6, wherein the data analysis unit, when analyzing the variation trend of the performance data based on a preset artificial intelligence model, is specifically configured to:

9. The log-based system failure prediction device of claim 6, wherein the data analysis unit is further configured to:

10. A log-based system failure prediction device, comprising:

a memory and a processor;

the memory is configured to store program code, and the processor is configured to invoke the program code and, when executed, to implement the log-based system failure prediction method of any of claims 1-5.