CN113656287A

CN113656287A - Method and device for predicting software instance fault, electronic equipment and storage medium

Info

Publication number: CN113656287A
Application number: CN202110860029.7A
Authority: CN
Inventors: 易存道
Original assignee: Beijing Baolande Software Co ltd
Current assignee: Beijing Baolande Software Co ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-16
Anticipated expiration: 2041-07-28
Also published as: CN113656287B

Abstract

According to the method, the device, the electronic equipment and the storage medium for predicting the software instance fault, provided by the invention, the real-time data of the alarm index of the software instance is obtained; predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index; whether the software instance fails or not is predicted in advance through the failure prediction model, and meanwhile possible failure points of the failure are predicted, so that effective basis can be provided for the failure repair, the time for manually troubleshooting failure reasons is saved, and the failure repair efficiency is improved.

Description

Method and device for predicting software instance fault, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computer information, in particular to a method and a device for predicting software instance faults, electronic equipment and a storage medium.

Background

With the change of computer software industry environment and the increasing complexity of the calling and deploying relationship of each service system, the calling relationship among the components is also increased, and the frequency of the fault and the abnormity of each service system is correspondingly increased, so that the timely repair and prediction of the associated instance fault in the service system become more important.

At present, the existing software instance fault repairing method is mainly used for checking machine equipment after a software instance fails, or checking the machine equipment manually and periodically; after the system is down, the possible reasons are predicted through manual experience, and then the predicted fault reasons are checked and repaired.

Therefore, the existing solution for the software instance fault can only repair the fault after the fault occurs, and the fault occurrence position and time point cannot be predicted before the fault occurs; when the fault reason is checked, the fault reason can be judged only through manual experience, so that the problems of long time spent on solving the fault of the software instance and low efficiency are solved.

Disclosure of Invention

The invention provides a method and a device for predicting a software instance fault, electronic equipment and a storage medium, which are used for solving the problem that the existing method for solving the software instance fault can only repair the fault after the fault occurs and can not predict the position and the time point of the fault before the fault occurs; when the fault reason is checked, the fault reason can be judged only through manual experience, so that the problems of long time spent on solving the fault of the software instance and low efficiency are solved; whether the software instance fails or not is predicted in advance through the failure prediction model, and meanwhile possible failure points of the failure are predicted, so that effective basis can be provided for the repair of the failure, the time for manually troubleshooting failure reasons is saved, and the failure repair efficiency is improved.

The invention provides a method for predicting software instance faults, which comprises the following steps:

acquiring real-time data of an alarm index of a software instance;

predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index;

the fault prediction model is trained based on fault indexes of software instances and alarm indexes associated with the fault indexes.

According to the method for predicting the software instance fault, provided by the invention, the fault of the software instance is predicted through a fault prediction model according to the real-time data of the alarm index, and the method comprises the following steps:

comparing the real-time data with the keyword set, and determining whether the alarm index is an abnormal alarm index; the keyword set comprises keywords extracted from abnormal historical alarm indexes of the software instance;

and if the alarm index is an abnormal alarm index, inputting the real-time data of the alarm index into the fault prediction model to predict the fault of the software instance.

According to the method for predicting the software instance fault, the step of inputting the real-time data of the alarm index into the fault prediction model and before predicting the fault of the software instance comprises the following steps:

determining the occurrence time of any fault index in the historical data of the software instance;

acquiring a preorder alarm index in a first preset time period before the occurrence moment, and acquiring a historical alarm index in a second preset time period before the occurrence moment; the second preset time period is greater than the first preset time period;

generating an association rule of the fault index and any preamble alarm index according to the preamble alarm index in the first preset time period and the historical alarm index in the second preset time period;

and establishing a fault prediction model according to the association rule.

According to the method for predicting software instance faults, the association rule of the fault index and any preamble alarm index is generated through the preamble alarm index in the first preset time period and the historical alarm index in the second preset time period, and the method comprises the following steps:

acquiring a target alarm index associated with the fault index through the preorder alarm index in the first preset time period;

performing barrel separation on the historical alarm indexes in the second preset time period through time slicing operation to generate a historical alarm index barrel;

calculating the association degree of the fault index and any target alarm index based on an Apriori algorithm according to a historical alarm index bucket; wherein the association degree comprises a support degree, a confidence degree and a promotion degree;

and generating an association rule of the fault index and any target alarm index according to the association degree.

According to the method for predicting the software instance fault, the step of obtaining the target alarm index which is relevant to the fault index comprises the following steps:

removing the duplication of the preorder alarm index to generate a preorder alarm index set;

and taking any preamble alarm index in the preamble alarm index set as a target alarm index associated with the fault index.

According to the method for predicting software instance faults provided by the invention, the generation of the association rule of the fault index and any preamble alarm index further comprises the following steps:

acquiring the time interval between the occurrence time of any target alarm index and the occurrence time of the fault index; wherein the time interval comprises: a maximum time interval, a minimum time interval, a median time interval;

and adding the time interval into the association rule so as to predict the occurrence time of the fault of the software instance.

According to the method for predicting the software instance fault, after the fault of the software instance is predicted, the method comprises the following steps:

if the software instance is predicted to have a fault, determining the fault type of the fault of the software instance;

according to the fault type, marking real-time data of the alarm index of the software instance;

and storing the real-time data of the alarm index and the prediction result into a relational database according to the mark to serve as a data source for displaying the prediction result of the software instance fault.

The invention also provides a device for predicting the software instance fault, which comprises:

the acquiring unit is used for acquiring real-time data of the alarm indexes of the software instances;

the prediction unit is used for predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the method for predicting the fault of the software instance.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for predicting a failure of a software instance as described in any one of the above.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for predicting a failure of an example software provided in an embodiment of the present invention;

fig. 2 is a schematic flowchart of a fault prediction method for implementing a software instance indicator based on Apriori algorithm according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a device for predicting a failure of an example software according to another embodiment of the present invention;

fig. 4 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

First, a conventional method for determining a fault in an indicator of a software instance will be described.

With the change of services in the software industry and the increasing complexity of the calling and deploying relationship of each service system, the calling relationship among the components is complex. Failure prediction of associated instances of a business system becomes particularly important. Artificial intelligence is introduced to predict in advance what faults a software instance may fail in the future. Such as jvm memory overflow, instance down, etc. The artificial intelligence is popularized to the operation and maintenance field, sufficient time is set for operation and maintenance personnel to solve in advance, the working efficiency is improved, the production environment fault occurrence probability is reduced, and the loss is reduced.

The existing software instance fault prediction methods are mainly divided into two types. The first is that after the software instance is abnormal, the machine is checked, if the utilization rate of a disk is too high, the service system is down, and the problem can be solved only by locating a specific reason after the system is down in general; the reason that may occur is predicted by human experience. Low efficiency, long solution time and loss to production environment. Secondly, a large amount of historical alarm data and fault event data are collected, fault events are respectively subdivided based on manual operation and maintenance experience, and then a classification model is trained by using a machine learning algorithm, so that when a new system alarm occurs, the trained classification model is called to predict a new alarm sample, and future fault classes are predicted, wherein no fault occurs and are a single class; this approach is a great improvement over the first, but the labeling of historical sample data requires a great deal of manual effort, and the quality of the manually labeled sample data has a great influence on the accuracy of the model.

The existing technical scheme does not have a complete failure prediction scheme, and can only rapidly locate the cause of the problem according to error information after the failure occurs through human experience, thereby solving the problem. There are several disadvantages to this approach:

in the traditional mode, operation and maintenance personnel generally check the state of a server and the index state of each application instance regularly, and when the index state possibly exceeds a threshold value, the problem is solved by processing faults; when the fault occurs but the fault level is not high, the problem cannot be timely processed and the fault cannot be solved;

the fault prediction model based on the machine learning supervised classification algorithm also has the defects that a large amount of manual energy needs to be invested in the labeling of historical sample data, and the quality of the manually labeled sample data can also have great influence on the accuracy of the model.

In view of the above disadvantages, the method for predicting a software instance fault provided in the embodiment of the present invention can predict a fault of an index based on a static threshold and an index based on a dynamic threshold in a software service, and label data of an alarm index, which is beneficial to classifying alarm data in a subsequent operation process.

The method for predicting the fault of the software instance provided by the invention is described in the following by combining the figures 1-2.

Fig. 1 is a schematic flowchart of a method for predicting a software instance fault according to an embodiment of the present invention. Referring to fig. 1, the method for predicting the software instance fault includes:

step 101: and acquiring real-time data of the alarm indexes of the software instances.

Along with the change of computer software industry environment and the increasing complexity of the calling and deploying relationship of each service system, the calling relationship among all the components is also increased, and the timely repair and prediction of associated instance faults in the service systems become more important; the software instances are software programs or processes in the business system, and each software instance includes one or more indexes.

Specifically, a large amount of data is generated in the running process of the software instance, wherein the data may be real-time running data generated by the software instance or real-time data of an alarm indicator in the software instance. Specifically, the real-time data of the alarm index of the software instance includes alarm data generated when the software instance operates normally, and also includes alarm data generated when the software instance actually has an abnormality.

Step 102: predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index;

the fault prediction model is trained based on fault indexes of software instances and historical alarm indexes associated with the fault indexes.

In the running process of the software instance, the generated data is stored and used as historical data; the stored content comprises the content of the historical data and whether the historical data is abnormal data. In this embodiment, a fault prediction model is established by collecting historical data within a certain period of time and using an Apriori algorithm, and the fault prediction model may be trained according to an association relationship between a fault index and an alarm index, so that after real-time data of the alarm index of a software instance is input into the fault prediction model, the fault prediction model may determine whether the software instance fails after the certain period of time according to experience of the historical data.

According to the method for predicting the software instance fault, the real-time data of the alarm index of the software instance is obtained; predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index; whether the software instance fails or not is predicted in advance through the failure prediction model, and meanwhile possible failure points of the failure are predicted, so that effective basis can be provided for the failure repair, the time for manually troubleshooting failure reasons is saved, and the failure repair efficiency is improved.

Further, on the basis of the above embodiment, according to the method for predicting a fault of a software instance provided by the present invention, predicting a fault of the software instance by a fault prediction model according to the real-time data of the alarm indicator includes:

Specifically, in this embodiment, whether a fault occurs is predicted according to real-time data generated by an alarm indicator of a software instance, that is, according to data of the alarm indicator, a cause that may cause the fault of the software instance is predicted; and acquiring data of alarm indexes of the software instances, wherein the data of the alarm indexes comprise alarm data generated in normal operation and alarm data generated when the software instances are actually abnormal.

The set of keywords may be generated prior to prediction from real-time data of alarm indicators of the software instance. Acquiring data of abnormal historical alarm indexes in historical data, determining abnormal historical alarm index data generated when an abnormality occurs in the data of the historical alarm indexes, extracting keywords capable of indicating that the alarm indexes of the software example are abnormal actually from the abnormal historical alarm index data, and generating a keyword set.

When the newly generated real-time data of the alarm index is matched with the keywords in the keyword set, the fact that the newly generated real-time data of the alarm index is abnormal is determined, namely the alarm index is an abnormal alarm index, the real-time data of the abnormal alarm index is input into a fault prediction model, faults possibly occurring in the software instance can be predicted, and the position and the occurrence probability of the faults are predicted.

In the embodiment, by acquiring the real-time data of the alarm index in the software instance and based on the keyword set, when the software instance service fails, the possible fault point of the fault can be predicted, an effective basis can be provided for repairing the fault, the time for manually troubleshooting the fault reason is saved, and the fault repairing efficiency is improved.

Further, on the basis of the above embodiment, according to the method for predicting a fault of a software instance provided by the present invention, before inputting the real-time data of the alarm indicator into the fault prediction model and predicting the fault of the software instance, the method includes:

and establishing a fault prediction model according to the association rule.

In the running process of the software example, different alarm indexes are generated before the occurrence of the fault index, namely the occurrence of the different alarm indexes is possibly predictive of the occurrence of the different fault indexes, so that the occurrence of the fault index can be predicted by determining the incidence relation between the fault index and the alarm index, and a fault prediction model is established by the incidence relation between the fault index and the alarm index.

Acquiring historical data of the software instance, selecting any fault index from the historical data, and determining the occurrence time of the fault index. And taking the alarm indexes in the first preset time period as the primary alarm indexes of the fault indexes, namely determining at least one alarm index which has correlation with the fault indexes. And then acquiring the historical alarm index in a second preset time period before the occurrence moment, wherein the second preset time period is greater than the first preset time period, and the historical alarm index in the second preset time period contains more information of the preorder alarm index. The second preset time period and the first preset time period can be set by personnel.

The strength of the relevance between the preorder alarm index and the fault index can be determined through the preorder alarm index and the historical alarm index, so that the relevance rule between the fault index and any preorder alarm index is generated, and the establishment of a fault prediction model is realized.

For example, setting a first preset time period to be one hour, setting a second preset time period to be one month, counting a unique fault alarm set based on a historical alarm data set and a time slicing unit, intercepting a historical alarm data table a of the month before the fault occurrence time and a unique alarm index set B of the 1 hour before the fault occurrence time for each fault G, and obtaining a result rule set C of data such as a preamble alarm index causing the fault G to occur and a time interval and occurrence probability of the result fault index caused by the corresponding preamble alarm index by analyzing an association relationship between the set B and the fault alarm G. And traversing all the fault indexes to obtain a result rule set C of all the fault indexes.

Specifically, character string data transmitted based on restful API is received, a character string json is deserialized to obtain a data dictionary dit, a historical alarm list in the dit is converted into a dataframe format of Python, each item in the historical alarm list is a dit dictionary of Python, and each dit contains fields such as alarm time eventtime, alarm index name item, alarm instance id, alarm type keywords and whether the alarm type keywords are faults.

Collecting the historical alarms of the system in the last N years (for example, 1 year), carrying out barrel division on the historical alarm data table dataframe according to the time sequence according to the time slice unit maxInternalMs in the dit, counting all fault indexes in the whole historical alarm data table dataframe, and carrying out duplication removal to obtain a unique fault index set GuZHANGList.

And acquiring the alarm indexes in the fault index set GuZhangList within one hour before any fault index G occurs, wherein the alarm indexes comprise an alarm index M, namely, the alarm index M and the fault index G are determined to have an association relation. And determining the association degree of the alarm index M and the fault index G according to the information such as the time and the frequency of the occurrence of the alarm index M in the historical alarm index data in the month before the occurrence of the fault index G. Therefore, the incidence relation between all fault indexes of the software instance and the alarm indexes is established, and a fault prediction model is established, so that the fault prediction is realized.

In this embodiment, an association rule between the fault index and any preamble alarm index is generated through the preamble alarm index in the first preset time period and the historical alarm index in the second preset time period, and a fault prediction model is established; and the association rules of all fault indexes in the software instance are generated, so that the prediction precision and efficiency of the fault indexes are improved.

Further, on the basis of the foregoing embodiment, according to the method for predicting a software instance fault provided in the present invention, the generating an association rule between the fault indicator and any preamble alarm indicator by using the preamble alarm indicator in the first preset time period and the historical alarm indicator in the second preset time period includes:

In the preamble alarm index in the first preset time period, the same alarm index may exist for multiple times, and the repetition removal processing needs to be performed on the preamble alarm index to generate a target alarm index, and the association between the target alarm index and the fault index is determined.

And performing barrel separation on the historical alarm indexes through time slicing operation, wherein the specific time slicing unit can be the same as that of the data dictionary dit. And traversing the obtained historical alarm index buckets to obtain the total number of all the buckets, and calculating the support degree, confidence degree and promotion degree of the fault index and the target alarm index according to the frequency of the target alarm index appearing in the historical alarm index buckets and the frequency of the fault index appearing.

Specifically, under the condition of traversing each fault index A of the fault index set GuZhangList, the time Atime of the first occurrence of the fault A in the whole alarm data table is obtained based on time, all alarm indexes in a 1-hour historical alarm data table dataframe are intercepted forwards based on the Atime, a unique alarm index set useFaultList is obtained after duplication is removed, and the basic principle of the operation is that the time range in which the current time fault A is located is considered to be one hour; intercepting alarm data df-30 in a history alarm data table dataframe for 30 days based on Atime, dividing the history alarm df-30 into buckets butList according to a time sequence according to a time slice unit maxInternalMs in an original dit, then traversing each bucket, and counting the total number D of the buckets butList, the number cntM of the buckets with the unique alarm index M in the buckets butList, the number cntMA of the buckets with the unique alarm index M and the fault alarm A in the same bucket, and the number cntA of the buckets with the A in the bucket butList.

The core idea of Apriori algorithm is:

the support degree is as follows: the Support of association rule a → B is p (ab), which refers to the probability that event a and event B occur simultaneously, i.e. Support (a → B) ═ p (ab);

confidence coefficient: a confidence (P (B | a) ═ P (ab)/P (a)) means a probability of occurrence of an event B based on occurrence of an event a;

the lifting degree is as follows: the fact that the probability that the event a causes the event B to occur really contributes to the occurrence of the event B is referred to as "lift" ((a → B)/p (B) ((ab)/p (a) ((B) ")) and p (B)"), and as long as the lift is greater than 1, the rule a → B can be considered as a strong and effective rule.

According to the Apriori algorithm, for the alarm index M → the fault index a, the rule M → a is recorded, and the formula is nested to obtain:

P(M)＝cntM/D，P(A)＝cntA/D，P(MA)＝cntMA/D；

data such as support degree, confidence degree, promotion degree and the like of the alarm index M relative to the fault index A can be obtained through calculation, and the data are used as association rules of the fault index A and the alarm index M.

In the embodiment, the association rule of the fault index and the alarm index is obtained through calculation by an Apriori algorithm, so that the fault alarm is predicted through real-time data of the alarm index, an effective basis is provided for fault repair, the time for manually troubleshooting the fault reason is saved, and the fault repair efficiency is improved.

Further, on the basis of the foregoing embodiment, according to the method for predicting a software instance fault provided by the present invention, the obtaining a target alarm indicator having a correlation with the fault indicator includes:

Specifically, after a first preset time period before the occurrence time of the fault indicator is obtained, the preamble alarm indicator in the first preset time period is obtained, where the same alarm indicator may exist for multiple times and the preamble alarm indicator needs to be deduplicated. The preorder alarm index obtained after the duplication removal is the target alarm index associated with the fault index.

In the embodiment, the target alarm index is obtained by removing the duplicate of the preorder alarm index, the alarm index associated with the fault index is accurately determined, and then the establishment of the fault prediction model is completed by determining the association relationship between the target alarm index and the fault index, so that the fault prediction of the software instance is realized.

Further, on the basis of the above embodiment, according to the method for predicting a software instance fault provided by the present invention, the generating of the association rule between the fault indicator and any preamble alarm indicator further includes:

And after the target alarm indexes having relevance with the fault indexes are determined, determining the time interval of any target alarm index and the fault index in a second preset time period. The method comprises the steps that the same fault index occurs for multiple times in historical data, the time interval between the occurrence time of a target alarm index and the occurrence time of the fault index is determined when the same fault index occurs every time, the maximum value, the minimum value and the median value of the time interval are calculated, and the calculated maximum time interval, minimum time interval and median time interval are added into an association rule of the fault index and the target alarm index. When the fault of the software index is predicted, the predicted time when the fault index is about to occur can be displayed in the prediction result.

Specifically, the average time interval meanElaps of the rule M → a is calculated, the idea is to count the time interval between the alarm index occurrence time and the fault index a occurrence time in each bucket, and traverse all the sub-buckets to form a time interval list, so that the average time interval meanElaps of M → a can be obtained, and similarly, the following objective can be obtained:

calculating rule M → A total time interval tolElaps;

calculating rule M → A maximum time interval maxElaps;

calculating rule M → A minimum time interval minElaps;

calculating the rule M → A median time interval mediaElaps;

and assembling the support degree, the confidence degree, the promotion degree, the total time interval, the maximum time interval, the minimum time interval and the median time interval in the incidence relation of the alarm index M and the fault index A to form a rule list, so that a background can be applied to system fault prediction and predict possible faults, corresponding time interval and confidence degree, effectiveness and the like according to the rule list given by the fault prediction algorithm service.

In this embodiment, by calculating the interval between the occurrence times of the target alarm indicator and the fault indicator and adding the time interval to the association rule, the occurrence time of the fault indicator can be predicted, so that a technician can know the occurrence condition of the fault indicator more clearly, repair the fault in time, and improve the fault repair efficiency.

Further, on the basis of the foregoing embodiment, according to the method for predicting a failure of a software instance provided by the present invention, after predicting the failure of the software instance, the method includes:

Predicting the fault of the software instance through a fault prediction model to obtain a prediction result, wherein the prediction result comprises the type of the fault; the different faults have different types, the real-time data corresponding to the different types of faults are marked, and the real-time data and a prediction result obtained by predicting the real-time data are stored in a relational database according to the marks, so that the incidence relation between the real-time data and the prediction result is established. When the prediction result is displayed to the technical staff through the display interface, the real-time data and the prediction result are obtained through the database, and the source of the prediction result can be completely displayed to the technical staff.

In this embodiment, by marking the real-time data and storing the real-time data and the prediction result in the relational database, a data source can be provided for displaying the prediction result, and the prediction result can be completely displayed to a technician, so that the technician can obtain more sufficient prediction content and can process the fault more quickly.

Further, when the failure prediction model is generated by Apriori algorithm, the method further includes:

based on restapi service, transmitting the historical data to an Apriori algorithm in a character string mode; the format of the historical data is converted so that the format of the historical data is suitable for the Apriori algorithm.

After a large amount of historical data is acquired, the historical data needs to be input into an Apriori algorithm background to be processed.

In the embodiment of the invention, a restapi service based on a sanic framework is developed, the input and output flows of historical index data can be standardized, the historical sample data of example indexes are transmitted to an algorithm service background in a character string mode to reduce the transmission cost and improve the performance, and the background analyzes, preprocesses and converts the historical data and then sends the historical data to a standard algorithm method to perform model training to obtain a model of a corresponding algorithm. After the historical data is analyzed, preprocessed and converted, the historical data transmitted to the algorithm background in the form of character strings can be converted into a data format which can be identified by an Apriori algorithm, and therefore the historical data is processed.

Through restapi service, the historical data is transmitted to an Apriori algorithm in a character string mode, and the format of the historical data is converted, so that the data transmission cost can be reduced, the performance of the algorithm can be improved, and the generation of a fault prediction model can be accelerated.

Further, after the fault of the software instance is predicted, the prediction result of the fault of the software instance also needs to be displayed; wherein the prediction result includes a fault type and an occurrence probability (i.e., confidence) of the fault.

After the software instance is predicted to have a fault through the fault prediction model, a prediction result needs to be displayed to a technician. Specifically, the prediction result of the software instance may include the type of the fault to be generated, the predicted time of the fault generation, the fault point that may exist when the fault occurs, and the like. By displaying the prediction data, technicians can be reminded to deal with the faults of the software instances in advance, so that the normal operation of the software instances is guaranteed.

Fig. 2 is a schematic flow chart of a method for implementing fault prediction of software instance indexes based on Apriori algorithm according to another embodiment of the present invention. Referring to fig. 2, specifically, the method for realizing the fault prediction of the software instance index based on the Apriori algorithm includes:

step 201: monitoring the data of the index in the latest month through monitoring equipment, namely acquiring the monitoring data of the index in the next month through the existing monitoring equipment of a user or other monitoring products;

step 202: according to the range to be checked by a user, inquiring a software instance call chain relation from a monitoring database in real time through a timing task, and inquiring instance index data between an instance and a call instance thereof for algorithm call;

step 203: developing a retapi service based on a sanic framework, standardizing historical index data input and result output processes, transmitting historical alarm sample data of example indexes to an algorithm service background in a character string mode to reduce transmission cost and improve performance, and transmitting the historical data to a fault prediction method based on an Apriori algorithm thought after the background analyzes, preprocesses and converts the historical data;

step 204: the basic realization idea of the scheme is that a unique fault alarm set is counted based on a historical alarm data set and a time slicing unit, a historical alarm data table A of one month before the fault occurrence moment and a unique alarm index set B of the previous 1 hour are intercepted aiming at each fault G, a result rule set C of data such as a preamble alarm index causing the fault G and a time interval, occurrence probability and the like of a result fault index caused by the corresponding preamble alarm index are obtained by analyzing the incidence relation between the set B and the fault alarm G, and a result rule set C of all fault indexes is obtained by traversing all the fault indexes;

step 205: acquiring corresponding historical service alarm data according to the service system in the prediction range;

step 206: according to the alarm data in the step 205, according to the detection range key words, matching the alarm data which is abnormal in the history, if the detection result is abnormal, sending the data to an AI (namely a fault prediction model) for detection;

step 207: marking the current data according to the failure result of AI prediction, and then storing the calculation index and the calculation result into a relational database to be used as a data source of a display interface.

Specifically, step 204 includes:

step 2041: receiving character string data transmitted based on restful API, deserializing the character string json to obtain a data dictionary dit, converting a historical alarm list in the dit into a dataframe format of Python, wherein each item in the historical alarm list is a dit dictionary of Python, and each dit contains fields such as alarm time eventtime, alarm index name item, alarm instance id, alarm type keywords, whether the item is a fault and the like.

Step 2042: and (3) acquiring the historical alarms of the system in the last N years, carrying out barrel division on the historical alarm dataframe according to the time sequence according to the time slice unit maxInternalMs in the dit, counting all fault indexes in the dataframe of the whole historical alarm data table, and removing the duplication to obtain a unique fault index set GuZHANGList.

Step 2043: under the condition of traversing each fault index A of a fault index set GuZHANGList, obtaining the time Atime of the first occurrence of the fault A in the whole alarm data table based on time, intercepting all alarm indexes in a 1-hour historical alarm data table dataframe based on the Atime, obtaining a unique alarm index set useFaultList after duplication is removed, and adopting the basic principle of operation that the time range in which the occurrence of the fault A at the current time is positioned is considered to be one hour; intercepting alarm data df-30 in a history alarm data table dataframe for 30 days based on Atime, dividing the history alarm df-30 into buckets butList according to a time sequence according to a time slice unit maxInternalMs in an original dit, then traversing each bucket, and counting the total number D of the buckets butList, the number cntM of the buckets with the unique alarm index M in the buckets butList, the number cntMA of the buckets with the unique alarm index M and the fault alarm A in the same bucket, and the number cntA of the buckets with the A in the bucket butList.

Step 2044: according to the core idea of Apriori algorithm:

An alarm indicator M → a fault indicator a, here denoted as rule M → a,

nested formulas may give p (m) ═ cntM/D, p (a) ═ cntA/D, p (ma) ═ cnma/D;

calculating a rule M → A average time interval meanElaps, and realizing the idea that the time interval between the alarm index occurrence time and the fault index A occurrence time in each bucket is counted, and a time interval list is formed after traversing all the sub-buckets, so that the average time interval meanElaps of M → A can be obtained, and the following targets can be obtained similarly:

calculating rule M → A total time interval tolElaps;

calculating rule M → A maximum time interval maxElaps;

calculating rule M → A minimum time interval minElaps;

calculating the rule M → A median time interval mediaElaps;

assembling the attribute values of the measurement rule M → A in the step, and finally returning the data result to the background as follows: {

A list of rules.

Step 2045: the background is used for predicting system faults according to a rule list given by the fault prediction algorithm service, predicting possible faults, corresponding time intervals, reliability, effectiveness and the like.

According to the scheme, the fault prediction is carried out by adopting the idea of a statistical probability model Apriori algorithm, historical alarm index data are collected firstly, a fault identification keyword is added to position a fault alarm, then the time association degree between an alarm index and the fault alarm is calculated based on the idea of the Apriori algorithm, namely, the association analysis based on time segmentation is carried out, so that the association relation index between the alarm index and the fault alarm is obtained, such as the confidence coefficient, the promotion degree, the support degree and the influence time range of the fault caused by the alarm index; the alarm indicator causes an average minimum time interval, an average maximum time interval, an average time interval, etc. of the occurrence of the fault. According to the scheme, a large amount of manual labeling cost can be saved, the possible future fault types can be accurately predicted, and the specific future time range is given, namely the average minimum time interval, the average maximum time interval, the average time interval and the like of the faults caused by the alarm indexes are given.

Therefore, the embodiments of the present invention are used to solve the following problems:

the relevance relation between historical alarms and faults is analyzed based on the Apriori algorithm idea, fault prediction model service is constructed, various indexes of relevance between alarm indexes and fault alarms are counted, and reliable index causal rules are filtered according to filtering threshold values of the indexes, so that fault prediction is achieved.

The method supports accumulative index fault prediction, such as the fault of CPU utilization rate, jvm memory overflow, disk utilization rate and other accumulative indexes, and predicts the fault type which may occur in a certain time period in the future through fault prediction model service based on the Apriori algorithm idea. And giving out early warning information.

The user can create detection scopes according to different dimensions of different business views. And acquiring the service alarm data of all the examples in the detection range. And predicting other possible faults through service alarm sent by a service system.

The fault prediction method has the advantages that manual labeling is not needed, manual intervention is not needed in full-automatic processing, an iterative updating fault prediction model is trained periodically, namely, the causal rule of the updated index, the corresponding incidence relation indexes (such as confidence coefficient, promotion degree and support degree) of the updated index and the indexes related to the influence duration (such as average minimum time interval, average maximum time interval and average time interval of fault occurrence caused by alarm indexes) are updated, and faults in the detection range are predicted in real time.

The device for predicting the software instance fault provided by the invention is described below, and the device for predicting the software instance fault described below and the method for predicting the software instance fault described above can be referred to correspondingly.

Fig. 3 is a schematic structural diagram of a device for predicting a software instance fault provided by the present invention, and referring to fig. 3, the device for predicting a software instance fault includes:

an obtaining unit 301, configured to obtain real-time data of an alarm indicator of a software instance;

the prediction unit 302 is configured to predict a fault of the software instance through a fault prediction model according to the real-time data of the alarm indicator;

The prediction apparatus for software instance faults provided in this embodiment is suitable for the prediction method for software instance faults provided in the foregoing embodiments, and is not described herein again.

Specifically, according to the prediction apparatus for software instance fault provided by the present invention, the predicting the fault of the software instance according to the real-time data of the alarm indicator and through the fault prediction model includes:

According to the prediction device for the software instance fault, the step of inputting the real-time data of the alarm index into the fault prediction model and before predicting the fault of the software instance comprises the following steps:

and establishing a fault prediction model according to the association rule.

According to the apparatus for predicting a software instance fault provided by the present invention, the generating of the association rule between the fault indicator and any preamble alarm indicator through the preamble alarm indicator in the first preset time period and the historical alarm indicator in the second preset time period includes:

According to the prediction apparatus for software instance fault provided by the present invention, the obtaining of the target alarm indicator having correlation with the fault indicator includes:

According to the apparatus for predicting a software instance fault provided by the present invention, the generating of the association rule between the fault indicator and any preamble alarm indicator further includes:

According to the prediction apparatus for software instance fault provided by the present invention, after predicting the fault of the software instance, the apparatus comprises:

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. Processor 410 may call logic instructions in memory 430 to perform a method of predicting a failure of a software instance, the method comprising: acquiring real-time data of an alarm index of a software instance; predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index; the fault prediction model is trained based on fault indexes of software instances and alarm indexes associated with the fault indexes.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method for predicting a failure of a software instance provided by the above methods, the method comprising: acquiring real-time data of an alarm index of a software instance; predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index; the fault prediction model is trained based on fault indexes of software instances and alarm indexes associated with the fault indexes.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the method for predicting a failure of a software instance provided in the above aspects, the method comprising: acquiring real-time data of an alarm index of a software instance; predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index; the fault prediction model is trained based on fault indexes of software instances and alarm indexes associated with the fault indexes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting a failure of a software instance, comprising:

acquiring real-time data of an alarm index of a software instance;

2. The method for predicting the fault of the software instance according to claim 1, wherein the predicting the fault of the software instance through a fault prediction model according to the real-time data of the alarm index comprises:

3. The method for predicting the fault of the software instance according to claim 1, wherein the step of inputting the real-time data of the alarm indicator into the fault prediction model to predict the fault of the software instance comprises the following steps:

and establishing a fault prediction model according to the association rule.

4. The method for predicting software instance fault according to claim 3, wherein the generating the association rule of the fault indicator and any preamble alarm indicator by using the preamble alarm indicator in the first preset time period and the historical alarm indicator in the second preset time period comprises:

5. The method for predicting software instance fault according to claim 4, wherein the obtaining of the target alarm indicator having correlation with the fault indicator comprises:

6. The method for predicting software instance fault according to claim 3 or 4, wherein the generating the association rule of the fault indicator and any preceding alarm indicator further comprises:

7. The method for predicting the failure of the software instance according to claim 1, wherein the predicting the failure of the software instance comprises:

8. An apparatus for predicting a failure of a software instance, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method for predicting a failure of a software instance according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the method for predicting a failure of a software instance according to any one of claims 1 to 7.