CN115408182A

CN115408182A - Service system fault positioning method and device

Info

Publication number: CN115408182A
Application number: CN202110587616.3A
Authority: CN
Inventors: 张玲; 朱丹; 徐海勇; 舒敏根; 廖丽玲; 梁恩磊; 郭艺娟
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-11-29

Abstract

The invention provides a method and a device for positioning a fault of a service system, wherein the method comprises the following steps: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance. On one hand, the invention realizes that only the target service instance with faults is subjected to fault detection, thereby greatly reducing the calculated amount and effectively improving the positioning efficiency; on the other hand, the fault reason of the service system is positioned layer by layer according to the hierarchical relation, so that the positioning result is more accurate.

Description

Service system fault positioning method and device

Technical Field

The present invention relates to the field of fault diagnosis technologies, and in particular, to a method and an apparatus for locating a fault in a service system.

Background

With the rapid development of computer technology, data information is more and more abundant. In order to obtain valuable information from abundant data information, a business system is adopted to process a large amount of data information. However, as the amount of data increases, the data structure becomes more and more complex, and the business logic between data becomes more and more complex, so that the business system is prone to failure. In order to timely process the fault problem of the service system, the fault source of the service system needs to be located so as to timely process the fault source.

In the prior art, indexes of all examples in a service system are usually monitored, an abnormality evaluation score of each example is calculated, and a fault source of the service system is determined according to the example of which the abnormality evaluation score meets a preset abnormality condition.

However, each business system typically contains many more independent services (i.e., code that implements certain functionality), each of which deploys many instances. Therefore, each business system contains a large number of instances. By positioning the service system fault in this way, the abnormal evaluation scores of all the examples in the service system need to be calculated in real time, so that the calculated amount is large and the positioning efficiency is low.

Disclosure of Invention

The invention provides a method and a device for positioning a fault of a service system, which are used for solving the defects of large calculated amount and low positioning efficiency caused by the fact that abnormal evaluation scores of all instances in the service system are calculated in real time in the prior art and realizing the improvement of the positioning efficiency.

The invention provides a method for positioning a fault of a service system, which comprises the following steps:

judging whether the service system has a fault according to the operation index of the service system in the current time period;

if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period;

selecting a target service from all services of the service system according to the judgment result of each service;

and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

According to the method for positioning the fault of the service system provided by the invention, the step of judging whether the service system has the fault or not according to the operation index of the service system in the current time period comprises the following steps:

inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system;

comparing the anomaly score with a first preset threshold;

judging whether the service system has a fault according to the comparison result;

the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.

According to the method for positioning the fault of the service system provided by the invention, the step of inputting the operation index of the service system in the current time interval into an abnormality detection model and outputting the abnormality score of the service system comprises the following steps:

inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system;

subtracting the operation indexes of the service system after reconstruction and before reconstruction, and then calculating an absolute value;

and taking the maximum value in the absolute value result as the abnormal score of the business system.

According to the method for locating the fault of the service system provided by the invention, the step of judging whether the fault exists in each service according to the operation index of each service of the service system in the current time period comprises the following steps:

fusing the operation indexes of each service at a plurality of moments in the current time period, fusing the operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the service system at a plurality of moments in the current time period;

subtracting the fusion results of the operation indexes of each service in the current time interval and the previous time interval of the current time interval, and calculating an absolute value;

calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period;

taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service;

and comparing the abnormal score of each service with a second preset threshold value, and judging whether each service has a fault according to the comparison result.

According to the method for positioning the fault of the service system provided by the invention, the fusion of the operation indexes of each service at a plurality of moments in the current time period comprises the following steps:

and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each service operation index at a plurality of moments in the current time period.

According to the method for positioning the fault of the business system provided by the invention, the step of selecting the target service from all the services of the business system according to the judgment result of each service comprises the following steps:

if the service with the fault is one, taking the service with the fault as the target service;

if the service with the fault is multiple, selecting the service which is executed firstly from the service with the fault as a target service according to the execution sequence of the service with the fault in the service system;

and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.

According to the method for locating the fault of the business system provided by the invention, the step of judging whether the fault exists in each instance according to the operation index of each instance of the target service at each moment in the current time interval comprises the following steps:

fusing operation indexes of each instance at a plurality of moments in the current time period, fusing operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the target service at a plurality of moments in the current time period;

subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value;

calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period;

taking the sum of the maximum value in the absolute value result corresponding to each example and the similarity result corresponding to each example as the abnormal score of each example;

and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to a comparison result.

The invention also provides a device for positioning the fault of the service system, which comprises:

the first judgment module is used for judging whether the service system has a fault or not according to the operation index of the service system in the current time period;

the second judgment module is used for judging whether each service has a fault according to the operation index of each service of the service system in the current time period if the service system has the fault;

the selection module is used for selecting a target service from all the services of the business system according to the judgment result of each service;

and the fault positioning module is used for judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the service system fault location methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the business system fault locating method as described in any of the above.

According to the method and the device for positioning the fault of the business system, the fault detection is carried out on the business system in real time, the fault detection is carried out on each service of the business system under the condition that the fault exists in the business system, so that the target service is determined, then the fault detection is carried out on each instance of the target service, and further the fault reason of the business system is positioned, so that on one hand, the fault detection is carried out only on the instance of the target service with the fault, the calculated amount is greatly reduced, and the positioning efficiency is effectively improved; on the other hand, the fault reasons of the service system are positioned layer by layer according to the hierarchical relationship, so that the positioning result is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for locating a fault in a service system according to the present invention;

fig. 2 is a schematic structural diagram of a service topology in the service system fault location method provided in the present invention;

FIG. 3 is a second flowchart of the method for locating a fault in a service system according to the present invention;

FIG. 4 is a schematic structural diagram of a service system fault location device provided by the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for locating a fault in a service system according to the present invention is described below with reference to fig. 1, which includes: step 101, judging whether a service system has a fault according to an operation index of the service system in a current time period;

the service system may be a data traffic package order service system or a take-out service system or other service systems having a multi-level calling relationship.

The number of the service systems may be one or more, and this embodiment does not specifically limit this.

The current time period includes a plurality of time instants, and the present embodiment does not specifically limit the number of time instants in the current time period.

The operation index of the service system is a plurality of performance indexes representing the overall operation state of the service system, including the service volume, the success rate and the delay of the service system, and the content of the operation index of the service system is not specifically limited in this embodiment.

Optionally, the traffic volume, the success rate, and the delay of the service system may be obtained by monitoring at regular time through a timing monitoring device, or by monitoring the overall operation condition of the service system according to a period, which is not specifically limited in this embodiment. The preset period can be set according to actual requirements, such as one minute.

And (4) judging whether the service system has faults or not by combining the operation indexes of the service system at a plurality of moments in the current time period. If the service system has no fault, the operation indexes of the service system at a plurality of moments in the next time period of the current time period are continuously monitored, and the judging method is repeated.

If the service system has faults, the components in the service system are subjected to layer-by-layer fault location, and the fault root cause example causing the faults of the service system is automatically located.

Step 102, if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period;

wherein, the service of the business system is a code for realizing a certain function.

The operation index of each service is the operation index generated in the process that the service responds to other services or the operation index generated in the process that each service requests other services. Wherein the operation index of the service comprises the traffic volume, the success rate and the delay of the service.

As shown in fig. 2, the service a includes a plurality of independent services, i.e., service 1, service 2 through service n, each service having a plurality of instances distributed therein. Wherein, there is a request and response relation on business between service 1 and service 2. The operation index generated in the process of the service 2 requesting the service 3 may be used as the operation index of the service 2, or the operation index generated in the process of the service 2 responding to the service 1 may be used as the operation index of the service 2.

Where, sec1 is a process in which service 1 requests service 2, and SecX is a process in which service 2 results notify a response to service 1.

And under the condition that the service system is determined to have faults, judging each service of the service system, and positioning the service causing the service system faults in the services of the service system according to the judgment result.

Optionally, the operation indexes of each service of the service system at multiple times in the current time period are obtained, and then whether each service has a fault is determined by combining the operation indexes at multiple times in the current time period.

103, selecting a target service from all services of the business system according to the judgment result of each service;

the judgment result of each service includes two conditions of existence of fault or nonexistence of fault.

When selecting a target service, determining the number of services with faults in a service system;

when a service with a fault exists in a service system, selecting the service with the fault from all the services of the service system as a target service; wherein the target service is a service causing a failure of the business system.

When there are a plurality of failed services in the service system, the plurality of failed services may be targeted, or one or more services may be selected from the plurality of failed services. And then, according to fault location of the instance under the target service, further acquiring the instance causing the fault of the target service, and further accurately locating the source causing the fault of the business system.

And step 104, judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

Specifically, under the condition that the target system is obtained, each service of the target system is judged, and an example causing the target system fault in the examples of the target service is positioned according to the judgment result.

Optionally, the running indexes of the instances at multiple moments in the current time period are combined to determine whether the instances have faults.

When the instance in the target service has a fault, taking the instance with the fault as a fault positioning result of the business system;

when no fault exists in the instances of the target service, all the instances of the target service are used as the fault positioning result of the business system, namely, the root of the fault of the business system is the service level fault generated by the target service.

According to the implementation, firstly, fault detection is carried out on a business system in real time, fault detection is carried out on each service of the business system under the condition that the business system has faults so as to determine target services, then fault detection is carried out on each instance of the target services, and further the fault reason of the business system is positioned, on one hand, fault detection is carried out only on the instances of the target services with the faults, so that the calculated amount is greatly reduced, and the positioning efficiency is effectively improved; on the other hand, the fault reason of the service system is positioned layer by layer according to the hierarchical relation, so that the positioning result is more accurate.

On the basis of the foregoing embodiment, in this example, the determining whether the service system has a fault according to the operation index of the service system in the current time period includes: inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system; comparing the anomaly score with a first preset threshold; judging whether the service system has a fault according to the comparison result; the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.

The anomaly detection model may be an anomaly detection model such as a self-Encoder model or a VAE (variant Auto Encoder) model, which is not specifically limited in this embodiment.

The structure of the anomaly detection model can be set according to actual requirements, such as the structures of an encoder and a decoder in the VAE model, and a loss function and the like can be set according to actual requirements. An example of a VAE model is shown in Table 1.

TABLE 1 VAE model Structure

As shown in fig. 3, before obtaining the abnormal score of the business system, the abnormal detection model is trained. Firstly, the operation index of the business system in the historical period is used as a sample, and the abnormal score of the business system in the historical period is used as a sample label. And carrying out normalization processing on the samples, inputting the normalized samples and the sample labels into an anomaly detection model, and training the anomaly detection model.

Wherein each historical period may be characterized by a sliding window. Sample X for any historical period _t Can be expressed as:

X _t ＝minmax(Traffic _t ∪SuccessRate _t ∪Latency _t )；

Traffic _t ＝{traffic _t-L+1 ，…，traffic _t-1 ，traffic _t }；

SuccessRate _t ＝{success_rate _t-L+1 ，…，success_rate _t-1 ，success_rate _t }；

Latency _t ＝{latency _t-L+1 ，…，latency _t-1 ，latency _t }；

wherein minmax (·) is the maximum and minimum normalization processing; traffic _t 、SuccessRate _t And Latency _t Respectively the service volume, success rate and delay of the service system at the time t; l is the length of the sliding window, i.e. the number of all time instants within the history period. The length of the sliding window can be set according to actual requirements, such as L =30.

The sample set TrainData can be represented as:

TrainData＝{X _t-trainnum+1 ,…，X _t-1 ，X _t }；

wherein, the train num is the number of training sample points and can be set according to actual requirements, for example, the train num is more than or equal to 4320.

After the anomaly detection model is trained, the operation index of the service system in the current time period is normalized, the normalized operation index is input into the trained anomaly detection model, and the anomaly score S of the service system at the current time t in the current time period is output _t (A)。

Wherein, the value range of the abnormal score at the current moment is [0,1]. The larger the abnormal score is, the more likely the service system is to be out of order at the current moment.

Then, the abnormal score S of the business system at the current moment is calculated _t (A) And a first predetermined threshold th _A Making a comparison if S _t (A)＞th _A And if the service system fails at the current moment t, performing layer-by-layer fault location on the components in the service system from the service, the service to the instance at the current moment t.

The first preset threshold may be set according to actual requirements, such as a value between 0.5 and 0.8.

In the embodiment, the anomaly detection model is adopted, so that whether the service system has faults or not can be accurately monitored in real time.

On the basis of the foregoing embodiment, in this embodiment, the inputting the operation index of the business system in the current time period into an anomaly detection model, and outputting the anomaly score of the business system includes: inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system; subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value; and taking the maximum value in the absolute value result as the abnormal score of the business system.

Specifically, the step of obtaining the abnormal score of the business system by using the abnormal detection model includes, firstly, reconstructing an operation index of the business system in a current time period by using the abnormal detection model, and outputting the reconstructed operation index.

Then, calculating the difference of the operation indexes before and after reconstruction, calculating the absolute value of the difference of the operation indexes before and after reconstruction, and taking the maximum value in the absolute value result as the abnormal score S of the service system at the current moment _t (A)。

Wherein the abnormal score is S _t (A) The calculation formula of (2) is as follows:

S _t (A)＝max(|TestOutput-TestData|)；

wherein max (·) is the maximum operation, and TestData and Testoutput are the operation indexes of the service system before and after reconstruction respectively.

The worse the reconstruction effect, the higher the probability of indicating that the service system has a fault.

According to the embodiment, the abnormal score of the service system is calculated directly according to the reconstruction result, and whether the service system has a fault can be determined quickly and accurately.

On the basis of the foregoing embodiments, in this embodiment, the determining whether each service has a fault according to the operation index of each service of the service system in the current time period includes: fusing operation indexes of each service at a plurality of moments in the current time period, fusing operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the service system at a plurality of moments in the current time period;

specifically, when it is determined that the service system has a fault at the current time t, fault location is performed on each service of the service system.

The operation indexes of the service system a and the service S include KPI (Key Performance Indicator) such as traffic volume, success rate, delay, and the like, and KPI is used for each operation index _i ,i∈[1,n]Each operation index is represented.

The manner of fusing any operation index of each service at multiple times in the current time period may be to calculate a mean value, a variance, and the like of the operation indexes at all times in the current time period, which is not limited in this embodiment.

Similarly, the method for fusing the operation indexes of each service at a plurality of moments in the previous period of the current period and the method for fusing the operation indexes of the service system at a plurality of moments in the current period are the same as the above fusion method.

Then, the fusion result of each operation index of the business system or the service is normalized, and a set generated by the fusion result after the normalization of all the operation indexes of the business system or the service is used as the local abnormal feature of the business system or the service.

The local abnormal features of the service system A are expressed as follows:

Feature(A,W _t )＝{Feature(W _t，kpi1 )∪Feature(W _t，kpi2 )∪…∪Feature(W _t,kpin )}；

wherein Feature (W) _t,kpi1 ) Feature (A, W) as a result of the fusion of the first KPI running index _t ) For service system A in current time interval W _t Local anomaly features within.

And in the same way, taking the set generated by the normalized fusion result of all the operation indexes of each service as the local abnormal feature of each service.

Subtracting the fusion result of each operation index of each service in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service;

the step of calculating the abnormal score of the service S includes, first, calculating the abnormal score of the service S in the current period W _t Last period W of the internal and current periods _t-1 The absolute value of the difference between the local abnormal features in the interior is obtained, then the maximum value in the absolute value result is obtained, and the specific calculation formula is as follows:

AnomalyDetection(S,W _t )＝max(|Feature(S,W _t )-Feature(S,W _t-1 )|)；

wherein, anomalyDetection (S, W) _t ) Feature (S, W) serving a local anomalous fluctuation score of S between the current time period and the previous time period _t ) For service S in the current time period W _t Local anomaly features within.

And calculating the similarity between the local abnormal feature of the service S in the current time interval and the local abnormal feature of the service system A in the current time interval, wherein the specific calculation formula is as follows:

P(S,W _t )＝Similarity{Feature(S,W _t ),Feature(A,W _t )}；

the similarity may be calculated by using a pearson correlation coefficient or a cosine similarity, which is not specifically limited in this embodiment.

And finally, taking the sum of the maximum value in the absolute value result corresponding to the service S and the similarity result corresponding to the service S as the abnormal score of the service S, wherein the specific calculation formula is as follows:

Score(S,W _t )＝α×AnomalyDetection{Feature(S,W _t )}+(1-α)×P(S,W _t )；

wherein α is a weight coefficient, which can be set according to actual requirements, such as 0.5.

And comparing the abnormal score of each service with a second preset threshold, and judging whether each service has a fault according to the comparison result.

After the abnormal score of each service is obtained, comparing the abnormal score of each service with a second preset threshold value;

if the service fault is greater than the second preset threshold value, the service has a fault; and if the current time is less than or equal to a second preset threshold value, the service has no fault.

The second preset threshold may be set according to actual requirements, such as 0.8.

When the fault source of the service level of the service system is positioned, the local abnormal fluctuation score of the operation index of each service between the current time period and the previous time period is considered, the incidence relation between the operation index of each service in the current time period and the service system is also considered, and the fault positioning accuracy is high for the fault drilling scene with the hierarchical relation.

In addition, when the abnormal score of each service is calculated, the operation index is directly trained, resource loss caused by the fact that a large amount of historical data are used for training the model when the abnormal score is calculated through the model is effectively avoided, the calculation process is simple, the calculation amount is small, and therefore the fault positioning efficiency is high.

On the basis of the foregoing embodiment, in this embodiment, the fusing the operation indexes of each service at multiple times in the current time period includes: and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each service operation index at a plurality of moments in the current time period.

The statistical characteristic values include a mean value, a variance, a maximum value, and a minimum value, which is not specifically limited in this embodiment.

The wavelet decomposition value may be a db2 wavelet decomposition value or the like, and this embodiment is not particularly limited thereto.

And after the statistical characteristic value, the exponential weighted moving average value and the wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period are obtained, carrying out normalization processing on the statistical characteristic value, the exponential weighted moving average value and the wavelet decomposition value. The expression for normalizing the ith operation index is as follows:

Feature(W _t,kpii )＝minmax({mean(kpi _i ),std(kpi _i ),…,ewma(kpi _i ),db2(kpi _i )})；

wherein i =1, \8230, n, n is the number of the operation indexes; mean (kpi) _i )、std(kpi _i )、ewma(kpi _i ) And db2 (kpi) _i ) Respectively is the mean value, the variance, the exponential weighted moving average value and the wavelet decomposition value of the ith running index.

On the basis of the foregoing embodiment, in this embodiment, the selecting a target service from all services of the business system according to the determination result of each service includes: if the service with the fault is one, taking the service with the fault as the target service; if the number of the services with faults is multiple, selecting the service executed firstly from the services with faults as a target service according to the execution sequence of the services with faults in the service system; and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.

Specifically, if only one service with the abnormal score larger than the second preset threshold exists, the service with the abnormal score larger than the second preset threshold is used as a target service, which is the service level fault source of the service system.

If a plurality of services with abnormal scores larger than the second preset threshold exist, the service level fault source of the business system is in the services with abnormal scores larger than the second preset threshold, priority is given to the service with the front execution sequence according to the execution sequence, namely the upstream and downstream relation, of the service with the fault in the call link, namely the priority is given to the upstream node, and the service with the front execution sequence is used as the target service.

And if the abnormal scores of all the services are not larger than a second preset threshold value, taking the service with the maximum abnormal score in the business system as a target service.

The embodiment uses a physical topology with an access relationship among services, namely a service call chain, to comprehensively consider the hierarchical relationship between the services and the service system and the upstream and downstream relationship among the services for locating a target service causing a service system fault. And the service system, the service and the instance are sequentially subjected to hierarchical drilling on a fault positioning level, and meanwhile, the upstream node is given priority on a single level, so that the problem of inaccurate fault positioning in multi-level upstream, downstream and upstream scenes is effectively solved.

In addition, when the abnormal evaluation threshold value is set to be higher, the target service causing the fault of the business system can be normally positioned, and the stability is better.

On the basis of the foregoing embodiments, in this embodiment, the determining whether each instance of the target service has a fault according to an operation index of each instance at each time in the current time period includes: fusing operation indexes of each instance at a plurality of moments in the current time period, fusing operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the target service at a plurality of moments in the current time period;

the method for fusing the operation indexes of each instance at multiple moments in the current time period or the next time period of the current time is the same as the method for fusing the operation indexes of each service at multiple moments in the current time period.

Then, normalization processing is carried out on the fusion result of each operation index of each instance, and a set generated by the fusion result after normalization of all the operation indexes of each instance is used as the local abnormal feature of each instance.

Subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each instance and the similarity result corresponding to each instance as the abnormal score of each instance;

wherein an example S is calculated _m Abnormal Score of (S) Score (S) _m ,W _t ) Is of the formula

Score(S _m ,W _t )＝{β×AnomalyDetection{Feature(S _m ,W _t )}+(1-β)×Similarity{Feature(S _m ，W _t ),Feature(S,W _t )}}；

Wherein, β is a weight coefficient, which can be set according to actual requirements, such as 0.5; anomalyDetection { Feature (S) _m ,W _t ) Is example S _m A local anomalous fluctuation score between the current time period and a previous time period; similarity { Feature (S) _m ,W _t ),Feature(S,W _t ) Is example S _m Similarity between the local anomaly characteristic in the current time period and the local anomaly characteristic of the target service S in the current time period.

And comparing the abnormal score of each example with a third preset threshold, and judging whether each service has a fault according to the comparison result.

After the abnormal score of each example is obtained, comparing the abnormal score of each example with a third preset threshold value;

if the instances with the abnormal scores larger than the third preset threshold exist, sequencing the instances with the faults according to the sequence of the abnormal scores from high to low, and taking the sequenced instances as fault positioning results of the service system.

And if the instances of the target service are not greater than the third preset threshold value, taking all the instances of the target service as the fault positioning result of the service system, namely, the fault of the service system comes from the target service.

The third preset threshold may be set according to actual requirements, such as 0.8.

In the embodiment, local abnormal feature extraction is performed on each operation index; and for each instance, acquiring a local abnormal fluctuation score by using the local abnormal features of the running indexes of the instance, and evaluating a similarity score by using the similarity of the instance and the local abnormal features of the upper-layer service, wherein the abnormal score of the instance consists of the local abnormal fluctuation score and the similarity evaluation score according to an allocation weight. The method for acquiring the abnormal score has better accuracy, can be better suitable for a load balancing scene, and fully considers the influence between the upper hierarchical relation and the lower hierarchical relation.

The service system fault location device provided by the present invention is described below, and the service system fault location device described below and the service system fault location method described above may be referred to in a corresponding manner.

As shown in fig. 4, the apparatus for locating a fault in a service system provided in this embodiment includes a first determining module 401, a second determining module 402, a selecting module 403, and a fault locating module 404, where:

the first judging module 401 is configured to judge whether a fault exists in a service system according to an operation index of the service system in a current time period;

The current time period includes a plurality of time instants, and the number of time instants in the current time period is not specifically limited in the present embodiment.

The second determining module 402 is configured to determine whether each service of the service system has a fault according to an operation index of each service in the current time period if the service system has the fault;

wherein the service of the business system is a code for implementing a certain function.

Optionally, the operation indexes of the services of the service system at multiple times in the current time period are obtained, and then the operation indexes at multiple times in the current time period are combined to judge whether each service has a fault.

The selection module 403 is configured to select a target service from all services of the service system according to a determination result of each service;

When there are a plurality of failed services in the service system, the plurality of failed services may be targeted, or one or more services may be selected from the plurality of failed services. And then, according to fault positioning of the instances under the target service, further acquiring the instances causing the target service fault, and further accurately positioning the source causing the service system fault.

The fault location module 404 is configured to determine whether each instance of the target service has a fault according to an operation index of each instance in the current time period, and obtain a fault location result of the service system according to a determination result of each instance.

Optionally, the operation indexes of the instances at multiple times in the current time period are combined to determine whether the instances have faults.

when no fault exists in the instances of the target service, all the instances of the target service are used as the fault location result of the business system, namely, the root cause of the fault of the business system is the service level fault generated by the target service.

On the basis of the foregoing embodiment, in this embodiment, the first determining module is specifically configured to: inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system; comparing the anomaly score with a first preset threshold; judging whether the service system has a fault according to the comparison result; the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.

On the basis of the above embodiment, the embodiment further includes a calculating module specifically configured to: inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system; subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value; and taking the maximum value in the absolute value result as the abnormal score of the business system.

On the basis of the foregoing embodiments, the second determining module in this embodiment is specifically configured to: fusing the operation indexes of each service at a plurality of moments in the current time period, fusing the operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the service system at a plurality of moments in the current time period; subtracting the fusion results of the operation indexes of each service in the current time interval and the previous time interval of the current time interval, and calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service; and comparing the abnormal score of each service with a second preset threshold, and judging whether each service has a fault according to the comparison result.

On the basis of the above embodiment, the present embodiment further includes a fusion module specifically configured to: and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period.

On the basis of the above embodiment, the module is specifically selected in this embodiment to: if the service with the fault is one, taking the service with the fault as the target service; if the number of the services with faults is multiple, selecting the service executed firstly from the services with faults as a target service according to the execution sequence of the services with faults in the service system; and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.

On the basis of the foregoing embodiments, the third determining module in this embodiment is specifically configured to: fusing the operation indexes of each instance at a plurality of moments in the current time period, fusing the operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the target service at a plurality of moments in the current time period; subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each instance and the similarity result corresponding to each instance as the abnormal score of each instance; and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to a comparison result.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a business system fault location method comprising: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the service system fault location method provided by the above methods, the method includes: judging whether the service system has a fault or not according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the service system fault location methods provided above, the method including: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for locating a fault of a service system is characterized by comprising the following steps:

2. The method of claim 1, wherein the determining whether the service system has a fault according to the operation index of the service system in the current time period comprises:

comparing the abnormality score with a first preset threshold;

3. The method for locating faults of a business system according to claim 2, wherein the step of inputting the operation indexes of the business system in the current time period into an abnormality detection model and outputting the abnormality scores of the business system comprises the steps of:

subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value;

4. The method for locating faults in a service system according to any one of claims 1 to 3, wherein the determining whether faults exist in each service according to the operation index of each service of the service system in the current time period includes:

5. The method according to claim 4, wherein the fusing the operation indexes of each service at a plurality of times in the current time period comprises:

and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period.

6. The method according to claim 4, wherein the selecting a target service from all services of the service system according to the determination result of each service comprises:

7. The method for locating faults in a business system according to any one of claims 1 to 3, wherein the step of judging whether faults exist in each instance according to the operation index of each instance of the target service at each moment in the current time period comprises the following steps:

subtracting the fusion result of each operation index of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value;

and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to the comparison result.

8. A service system fault location device, comprising:

the selection module is used for selecting a target service from all services of the business system according to the judgment result of each service;

and the fault positioning module is used for judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring the fault positioning result of the service system according to the judgment result of each instance.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the business system fault location method according to any one of claims 1 to 7.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the business system fault localization method according to any one of claims 1 to 7.