CN115408182A - Service system fault positioning method and device - Google Patents

Service system fault positioning method and device Download PDF

Info

Publication number
CN115408182A
CN115408182A CN202110587616.3A CN202110587616A CN115408182A CN 115408182 A CN115408182 A CN 115408182A CN 202110587616 A CN202110587616 A CN 202110587616A CN 115408182 A CN115408182 A CN 115408182A
Authority
CN
China
Prior art keywords
service
fault
current time
time period
service system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110587616.3A
Other languages
Chinese (zh)
Inventor
张玲
朱丹
徐海勇
舒敏根
廖丽玲
梁恩磊
郭艺娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110587616.3A priority Critical patent/CN115408182A/en
Publication of CN115408182A publication Critical patent/CN115408182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention provides a method and a device for positioning a fault of a service system, wherein the method comprises the following steps: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance. On one hand, the invention realizes that only the target service instance with faults is subjected to fault detection, thereby greatly reducing the calculated amount and effectively improving the positioning efficiency; on the other hand, the fault reason of the service system is positioned layer by layer according to the hierarchical relation, so that the positioning result is more accurate.

Description

Service system fault positioning method and device
Technical Field
The present invention relates to the field of fault diagnosis technologies, and in particular, to a method and an apparatus for locating a fault in a service system.
Background
With the rapid development of computer technology, data information is more and more abundant. In order to obtain valuable information from abundant data information, a business system is adopted to process a large amount of data information. However, as the amount of data increases, the data structure becomes more and more complex, and the business logic between data becomes more and more complex, so that the business system is prone to failure. In order to timely process the fault problem of the service system, the fault source of the service system needs to be located so as to timely process the fault source.
In the prior art, indexes of all examples in a service system are usually monitored, an abnormality evaluation score of each example is calculated, and a fault source of the service system is determined according to the example of which the abnormality evaluation score meets a preset abnormality condition.
However, each business system typically contains many more independent services (i.e., code that implements certain functionality), each of which deploys many instances. Therefore, each business system contains a large number of instances. By positioning the service system fault in this way, the abnormal evaluation scores of all the examples in the service system need to be calculated in real time, so that the calculated amount is large and the positioning efficiency is low.
Disclosure of Invention
The invention provides a method and a device for positioning a fault of a service system, which are used for solving the defects of large calculated amount and low positioning efficiency caused by the fact that abnormal evaluation scores of all instances in the service system are calculated in real time in the prior art and realizing the improvement of the positioning efficiency.
The invention provides a method for positioning a fault of a service system, which comprises the following steps:
judging whether the service system has a fault according to the operation index of the service system in the current time period;
if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period;
selecting a target service from all services of the service system according to the judgment result of each service;
and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
According to the method for positioning the fault of the service system provided by the invention, the step of judging whether the service system has the fault or not according to the operation index of the service system in the current time period comprises the following steps:
inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system;
comparing the anomaly score with a first preset threshold;
judging whether the service system has a fault according to the comparison result;
the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.
According to the method for positioning the fault of the service system provided by the invention, the step of inputting the operation index of the service system in the current time interval into an abnormality detection model and outputting the abnormality score of the service system comprises the following steps:
inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system;
subtracting the operation indexes of the service system after reconstruction and before reconstruction, and then calculating an absolute value;
and taking the maximum value in the absolute value result as the abnormal score of the business system.
According to the method for locating the fault of the service system provided by the invention, the step of judging whether the fault exists in each service according to the operation index of each service of the service system in the current time period comprises the following steps:
fusing the operation indexes of each service at a plurality of moments in the current time period, fusing the operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the service system at a plurality of moments in the current time period;
subtracting the fusion results of the operation indexes of each service in the current time interval and the previous time interval of the current time interval, and calculating an absolute value;
calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period;
taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service;
and comparing the abnormal score of each service with a second preset threshold value, and judging whether each service has a fault according to the comparison result.
According to the method for positioning the fault of the service system provided by the invention, the fusion of the operation indexes of each service at a plurality of moments in the current time period comprises the following steps:
and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each service operation index at a plurality of moments in the current time period.
According to the method for positioning the fault of the business system provided by the invention, the step of selecting the target service from all the services of the business system according to the judgment result of each service comprises the following steps:
if the service with the fault is one, taking the service with the fault as the target service;
if the service with the fault is multiple, selecting the service which is executed firstly from the service with the fault as a target service according to the execution sequence of the service with the fault in the service system;
and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.
According to the method for locating the fault of the business system provided by the invention, the step of judging whether the fault exists in each instance according to the operation index of each instance of the target service at each moment in the current time interval comprises the following steps:
fusing operation indexes of each instance at a plurality of moments in the current time period, fusing operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the target service at a plurality of moments in the current time period;
subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value;
calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period;
taking the sum of the maximum value in the absolute value result corresponding to each example and the similarity result corresponding to each example as the abnormal score of each example;
and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to a comparison result.
The invention also provides a device for positioning the fault of the service system, which comprises:
the first judgment module is used for judging whether the service system has a fault or not according to the operation index of the service system in the current time period;
the second judgment module is used for judging whether each service has a fault according to the operation index of each service of the service system in the current time period if the service system has the fault;
the selection module is used for selecting a target service from all the services of the business system according to the judgment result of each service;
and the fault positioning module is used for judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the service system fault location methods.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the business system fault locating method as described in any of the above.
According to the method and the device for positioning the fault of the business system, the fault detection is carried out on the business system in real time, the fault detection is carried out on each service of the business system under the condition that the fault exists in the business system, so that the target service is determined, then the fault detection is carried out on each instance of the target service, and further the fault reason of the business system is positioned, so that on one hand, the fault detection is carried out only on the instance of the target service with the fault, the calculated amount is greatly reduced, and the positioning efficiency is effectively improved; on the other hand, the fault reasons of the service system are positioned layer by layer according to the hierarchical relationship, so that the positioning result is more accurate.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for locating a fault in a service system according to the present invention;
fig. 2 is a schematic structural diagram of a service topology in the service system fault location method provided in the present invention;
FIG. 3 is a second flowchart of the method for locating a fault in a service system according to the present invention;
FIG. 4 is a schematic structural diagram of a service system fault location device provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for locating a fault in a service system according to the present invention is described below with reference to fig. 1, which includes: step 101, judging whether a service system has a fault according to an operation index of the service system in a current time period;
the service system may be a data traffic package order service system or a take-out service system or other service systems having a multi-level calling relationship.
The number of the service systems may be one or more, and this embodiment does not specifically limit this.
The current time period includes a plurality of time instants, and the present embodiment does not specifically limit the number of time instants in the current time period.
The operation index of the service system is a plurality of performance indexes representing the overall operation state of the service system, including the service volume, the success rate and the delay of the service system, and the content of the operation index of the service system is not specifically limited in this embodiment.
Optionally, the traffic volume, the success rate, and the delay of the service system may be obtained by monitoring at regular time through a timing monitoring device, or by monitoring the overall operation condition of the service system according to a period, which is not specifically limited in this embodiment. The preset period can be set according to actual requirements, such as one minute.
And (4) judging whether the service system has faults or not by combining the operation indexes of the service system at a plurality of moments in the current time period. If the service system has no fault, the operation indexes of the service system at a plurality of moments in the next time period of the current time period are continuously monitored, and the judging method is repeated.
If the service system has faults, the components in the service system are subjected to layer-by-layer fault location, and the fault root cause example causing the faults of the service system is automatically located.
Step 102, if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period;
wherein, the service of the business system is a code for realizing a certain function.
The operation index of each service is the operation index generated in the process that the service responds to other services or the operation index generated in the process that each service requests other services. Wherein the operation index of the service comprises the traffic volume, the success rate and the delay of the service.
As shown in fig. 2, the service a includes a plurality of independent services, i.e., service 1, service 2 through service n, each service having a plurality of instances distributed therein. Wherein, there is a request and response relation on business between service 1 and service 2. The operation index generated in the process of the service 2 requesting the service 3 may be used as the operation index of the service 2, or the operation index generated in the process of the service 2 responding to the service 1 may be used as the operation index of the service 2.
Where, sec1 is a process in which service 1 requests service 2, and SecX is a process in which service 2 results notify a response to service 1.
And under the condition that the service system is determined to have faults, judging each service of the service system, and positioning the service causing the service system faults in the services of the service system according to the judgment result.
Optionally, the operation indexes of each service of the service system at multiple times in the current time period are obtained, and then whether each service has a fault is determined by combining the operation indexes at multiple times in the current time period.
103, selecting a target service from all services of the business system according to the judgment result of each service;
the judgment result of each service includes two conditions of existence of fault or nonexistence of fault.
When selecting a target service, determining the number of services with faults in a service system;
when a service with a fault exists in a service system, selecting the service with the fault from all the services of the service system as a target service; wherein the target service is a service causing a failure of the business system.
When there are a plurality of failed services in the service system, the plurality of failed services may be targeted, or one or more services may be selected from the plurality of failed services. And then, according to fault location of the instance under the target service, further acquiring the instance causing the fault of the target service, and further accurately locating the source causing the fault of the business system.
And step 104, judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
Specifically, under the condition that the target system is obtained, each service of the target system is judged, and an example causing the target system fault in the examples of the target service is positioned according to the judgment result.
Optionally, the running indexes of the instances at multiple moments in the current time period are combined to determine whether the instances have faults.
When the instance in the target service has a fault, taking the instance with the fault as a fault positioning result of the business system;
when no fault exists in the instances of the target service, all the instances of the target service are used as the fault positioning result of the business system, namely, the root of the fault of the business system is the service level fault generated by the target service.
According to the implementation, firstly, fault detection is carried out on a business system in real time, fault detection is carried out on each service of the business system under the condition that the business system has faults so as to determine target services, then fault detection is carried out on each instance of the target services, and further the fault reason of the business system is positioned, on one hand, fault detection is carried out only on the instances of the target services with the faults, so that the calculated amount is greatly reduced, and the positioning efficiency is effectively improved; on the other hand, the fault reason of the service system is positioned layer by layer according to the hierarchical relation, so that the positioning result is more accurate.
On the basis of the foregoing embodiment, in this example, the determining whether the service system has a fault according to the operation index of the service system in the current time period includes: inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system; comparing the anomaly score with a first preset threshold; judging whether the service system has a fault according to the comparison result; the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.
The anomaly detection model may be an anomaly detection model such as a self-Encoder model or a VAE (variant Auto Encoder) model, which is not specifically limited in this embodiment.
The structure of the anomaly detection model can be set according to actual requirements, such as the structures of an encoder and a decoder in the VAE model, and a loss function and the like can be set according to actual requirements. An example of a VAE model is shown in Table 1.
TABLE 1 VAE model Structure
Figure BDA0003088262820000091
As shown in fig. 3, before obtaining the abnormal score of the business system, the abnormal detection model is trained. Firstly, the operation index of the business system in the historical period is used as a sample, and the abnormal score of the business system in the historical period is used as a sample label. And carrying out normalization processing on the samples, inputting the normalized samples and the sample labels into an anomaly detection model, and training the anomaly detection model.
Wherein each historical period may be characterized by a sliding window. Sample X for any historical period t Can be expressed as:
X t =minmax(Traffic t ∪SuccessRate t ∪Latency t );
Traffic t ={traffic t-L+1 ,…,traffic t-1 ,traffic t };
SuccessRate t ={success_rate t-L+1 ,…,success_rate t-1 ,success_rate t };
Latency t ={latency t-L+1 ,…,latency t-1 ,latency t };
wherein minmax (·) is the maximum and minimum normalization processing; traffic t 、SuccessRate t And Latency t Respectively the service volume, success rate and delay of the service system at the time t; l is the length of the sliding window, i.e. the number of all time instants within the history period. The length of the sliding window can be set according to actual requirements, such as L =30.
The sample set TrainData can be represented as:
TrainData={X t-trainnum+1 ,…,X t-1 ,X t };
wherein, the train num is the number of training sample points and can be set according to actual requirements, for example, the train num is more than or equal to 4320.
After the anomaly detection model is trained, the operation index of the service system in the current time period is normalized, the normalized operation index is input into the trained anomaly detection model, and the anomaly score S of the service system at the current time t in the current time period is output t (A)。
Wherein, the value range of the abnormal score at the current moment is [0,1]. The larger the abnormal score is, the more likely the service system is to be out of order at the current moment.
Then, the abnormal score S of the business system at the current moment is calculated t (A) And a first predetermined threshold th A Making a comparison if S t (A)>th A And if the service system fails at the current moment t, performing layer-by-layer fault location on the components in the service system from the service, the service to the instance at the current moment t.
The first preset threshold may be set according to actual requirements, such as a value between 0.5 and 0.8.
In the embodiment, the anomaly detection model is adopted, so that whether the service system has faults or not can be accurately monitored in real time.
On the basis of the foregoing embodiment, in this embodiment, the inputting the operation index of the business system in the current time period into an anomaly detection model, and outputting the anomaly score of the business system includes: inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system; subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value; and taking the maximum value in the absolute value result as the abnormal score of the business system.
Specifically, the step of obtaining the abnormal score of the business system by using the abnormal detection model includes, firstly, reconstructing an operation index of the business system in a current time period by using the abnormal detection model, and outputting the reconstructed operation index.
Then, calculating the difference of the operation indexes before and after reconstruction, calculating the absolute value of the difference of the operation indexes before and after reconstruction, and taking the maximum value in the absolute value result as the abnormal score S of the service system at the current moment t (A)。
Wherein the abnormal score is S t (A) The calculation formula of (2) is as follows:
S t (A)=max(|TestOutput-TestData|);
wherein max (·) is the maximum operation, and TestData and Testoutput are the operation indexes of the service system before and after reconstruction respectively.
The worse the reconstruction effect, the higher the probability of indicating that the service system has a fault.
According to the embodiment, the abnormal score of the service system is calculated directly according to the reconstruction result, and whether the service system has a fault can be determined quickly and accurately.
On the basis of the foregoing embodiments, in this embodiment, the determining whether each service has a fault according to the operation index of each service of the service system in the current time period includes: fusing operation indexes of each service at a plurality of moments in the current time period, fusing operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the service system at a plurality of moments in the current time period;
specifically, when it is determined that the service system has a fault at the current time t, fault location is performed on each service of the service system.
The operation indexes of the service system a and the service S include KPI (Key Performance Indicator) such as traffic volume, success rate, delay, and the like, and KPI is used for each operation index i ,i∈[1,n]Each operation index is represented.
The manner of fusing any operation index of each service at multiple times in the current time period may be to calculate a mean value, a variance, and the like of the operation indexes at all times in the current time period, which is not limited in this embodiment.
Similarly, the method for fusing the operation indexes of each service at a plurality of moments in the previous period of the current period and the method for fusing the operation indexes of the service system at a plurality of moments in the current period are the same as the above fusion method.
Then, the fusion result of each operation index of the business system or the service is normalized, and a set generated by the fusion result after the normalization of all the operation indexes of the business system or the service is used as the local abnormal feature of the business system or the service.
The local abnormal features of the service system A are expressed as follows:
Feature(A,W t )={Feature(W t,kpi1 )∪Feature(W t,kpi2 )∪…∪Feature(W t,kpin )};
wherein Feature (W) t,kpi1 ) Feature (A, W) as a result of the fusion of the first KPI running index t ) For service system A in current time interval W t Local anomaly features within.
And in the same way, taking the set generated by the normalized fusion result of all the operation indexes of each service as the local abnormal feature of each service.
Subtracting the fusion result of each operation index of each service in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service;
the step of calculating the abnormal score of the service S includes, first, calculating the abnormal score of the service S in the current period W t Last period W of the internal and current periods t-1 The absolute value of the difference between the local abnormal features in the interior is obtained, then the maximum value in the absolute value result is obtained, and the specific calculation formula is as follows:
AnomalyDetection(S,W t )=max(|Feature(S,W t )-Feature(S,W t-1 )|);
wherein, anomalyDetection (S, W) t ) Feature (S, W) serving a local anomalous fluctuation score of S between the current time period and the previous time period t ) For service S in the current time period W t Local anomaly features within.
And calculating the similarity between the local abnormal feature of the service S in the current time interval and the local abnormal feature of the service system A in the current time interval, wherein the specific calculation formula is as follows:
P(S,W t )=Similarity{Feature(S,W t ),Feature(A,W t )};
the similarity may be calculated by using a pearson correlation coefficient or a cosine similarity, which is not specifically limited in this embodiment.
And finally, taking the sum of the maximum value in the absolute value result corresponding to the service S and the similarity result corresponding to the service S as the abnormal score of the service S, wherein the specific calculation formula is as follows:
Score(S,W t )=α×AnomalyDetection{Feature(S,W t )}+(1-α)×P(S,W t );
wherein α is a weight coefficient, which can be set according to actual requirements, such as 0.5.
And comparing the abnormal score of each service with a second preset threshold, and judging whether each service has a fault according to the comparison result.
After the abnormal score of each service is obtained, comparing the abnormal score of each service with a second preset threshold value;
if the service fault is greater than the second preset threshold value, the service has a fault; and if the current time is less than or equal to a second preset threshold value, the service has no fault.
The second preset threshold may be set according to actual requirements, such as 0.8.
When the fault source of the service level of the service system is positioned, the local abnormal fluctuation score of the operation index of each service between the current time period and the previous time period is considered, the incidence relation between the operation index of each service in the current time period and the service system is also considered, and the fault positioning accuracy is high for the fault drilling scene with the hierarchical relation.
In addition, when the abnormal score of each service is calculated, the operation index is directly trained, resource loss caused by the fact that a large amount of historical data are used for training the model when the abnormal score is calculated through the model is effectively avoided, the calculation process is simple, the calculation amount is small, and therefore the fault positioning efficiency is high.
On the basis of the foregoing embodiment, in this embodiment, the fusing the operation indexes of each service at multiple times in the current time period includes: and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each service operation index at a plurality of moments in the current time period.
The statistical characteristic values include a mean value, a variance, a maximum value, and a minimum value, which is not specifically limited in this embodiment.
The wavelet decomposition value may be a db2 wavelet decomposition value or the like, and this embodiment is not particularly limited thereto.
And after the statistical characteristic value, the exponential weighted moving average value and the wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period are obtained, carrying out normalization processing on the statistical characteristic value, the exponential weighted moving average value and the wavelet decomposition value. The expression for normalizing the ith operation index is as follows:
Feature(W t,kpii )=minmax({mean(kpi i ),std(kpi i ),…,ewma(kpi i ),db2(kpi i )});
wherein i =1, \8230, n, n is the number of the operation indexes; mean (kpi) i )、std(kpi i )、ewma(kpi i ) And db2 (kpi) i ) Respectively is the mean value, the variance, the exponential weighted moving average value and the wavelet decomposition value of the ith running index.
On the basis of the foregoing embodiment, in this embodiment, the selecting a target service from all services of the business system according to the determination result of each service includes: if the service with the fault is one, taking the service with the fault as the target service; if the number of the services with faults is multiple, selecting the service executed firstly from the services with faults as a target service according to the execution sequence of the services with faults in the service system; and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.
Specifically, if only one service with the abnormal score larger than the second preset threshold exists, the service with the abnormal score larger than the second preset threshold is used as a target service, which is the service level fault source of the service system.
If a plurality of services with abnormal scores larger than the second preset threshold exist, the service level fault source of the business system is in the services with abnormal scores larger than the second preset threshold, priority is given to the service with the front execution sequence according to the execution sequence, namely the upstream and downstream relation, of the service with the fault in the call link, namely the priority is given to the upstream node, and the service with the front execution sequence is used as the target service.
And if the abnormal scores of all the services are not larger than a second preset threshold value, taking the service with the maximum abnormal score in the business system as a target service.
The embodiment uses a physical topology with an access relationship among services, namely a service call chain, to comprehensively consider the hierarchical relationship between the services and the service system and the upstream and downstream relationship among the services for locating a target service causing a service system fault. And the service system, the service and the instance are sequentially subjected to hierarchical drilling on a fault positioning level, and meanwhile, the upstream node is given priority on a single level, so that the problem of inaccurate fault positioning in multi-level upstream, downstream and upstream scenes is effectively solved.
In addition, when the abnormal evaluation threshold value is set to be higher, the target service causing the fault of the business system can be normally positioned, and the stability is better.
On the basis of the foregoing embodiments, in this embodiment, the determining whether each instance of the target service has a fault according to an operation index of each instance at each time in the current time period includes: fusing operation indexes of each instance at a plurality of moments in the current time period, fusing operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the target service at a plurality of moments in the current time period;
the method for fusing the operation indexes of each instance at multiple moments in the current time period or the next time period of the current time is the same as the method for fusing the operation indexes of each service at multiple moments in the current time period.
Then, normalization processing is carried out on the fusion result of each operation index of each instance, and a set generated by the fusion result after normalization of all the operation indexes of each instance is used as the local abnormal feature of each instance.
Subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each instance and the similarity result corresponding to each instance as the abnormal score of each instance;
wherein an example S is calculated m Abnormal Score of (S) Score (S) m ,W t ) Is of the formula
Score(S m ,W t )={β×AnomalyDetection{Feature(S m ,W t )}+(1-β)×Similarity{Feature(S m ,W t ),Feature(S,W t )}};
Wherein, β is a weight coefficient, which can be set according to actual requirements, such as 0.5; anomalyDetection { Feature (S) m ,W t ) Is example S m A local anomalous fluctuation score between the current time period and a previous time period; similarity { Feature (S) m ,W t ),Feature(S,W t ) Is example S m Similarity between the local anomaly characteristic in the current time period and the local anomaly characteristic of the target service S in the current time period.
And comparing the abnormal score of each example with a third preset threshold, and judging whether each service has a fault according to the comparison result.
After the abnormal score of each example is obtained, comparing the abnormal score of each example with a third preset threshold value;
if the instances with the abnormal scores larger than the third preset threshold exist, sequencing the instances with the faults according to the sequence of the abnormal scores from high to low, and taking the sequenced instances as fault positioning results of the service system.
And if the instances of the target service are not greater than the third preset threshold value, taking all the instances of the target service as the fault positioning result of the service system, namely, the fault of the service system comes from the target service.
The third preset threshold may be set according to actual requirements, such as 0.8.
In the embodiment, local abnormal feature extraction is performed on each operation index; and for each instance, acquiring a local abnormal fluctuation score by using the local abnormal features of the running indexes of the instance, and evaluating a similarity score by using the similarity of the instance and the local abnormal features of the upper-layer service, wherein the abnormal score of the instance consists of the local abnormal fluctuation score and the similarity evaluation score according to an allocation weight. The method for acquiring the abnormal score has better accuracy, can be better suitable for a load balancing scene, and fully considers the influence between the upper hierarchical relation and the lower hierarchical relation.
The service system fault location device provided by the present invention is described below, and the service system fault location device described below and the service system fault location method described above may be referred to in a corresponding manner.
As shown in fig. 4, the apparatus for locating a fault in a service system provided in this embodiment includes a first determining module 401, a second determining module 402, a selecting module 403, and a fault locating module 404, where:
the first judging module 401 is configured to judge whether a fault exists in a service system according to an operation index of the service system in a current time period;
the service system may be a data traffic package order service system or a take-out service system or other service systems having a multi-level calling relationship.
The number of the service systems may be one or more, and this embodiment does not specifically limit this.
The current time period includes a plurality of time instants, and the number of time instants in the current time period is not specifically limited in the present embodiment.
The operation index of the service system is a plurality of performance indexes representing the overall operation state of the service system, including the service volume, the success rate and the delay of the service system, and the content of the operation index of the service system is not specifically limited in this embodiment.
Optionally, the traffic volume, the success rate, and the delay of the service system may be obtained by monitoring at regular time through a timing monitoring device, or by monitoring the overall operation condition of the service system according to a period, which is not specifically limited in this embodiment. The preset period can be set according to actual requirements, such as one minute.
And (4) judging whether the service system has faults or not by combining the operation indexes of the service system at a plurality of moments in the current time period. If the service system has no fault, the operation indexes of the service system at a plurality of moments in the next time period of the current time period are continuously monitored, and the judging method is repeated.
If the service system has faults, the components in the service system are subjected to layer-by-layer fault location, and the fault root cause example causing the faults of the service system is automatically located.
The second determining module 402 is configured to determine whether each service of the service system has a fault according to an operation index of each service in the current time period if the service system has the fault;
wherein the service of the business system is a code for implementing a certain function.
The operation index of each service is the operation index generated in the process that the service responds to other services or the operation index generated in the process that each service requests other services. Wherein the operation index of the service comprises the traffic volume, the success rate and the delay of the service.
As shown in fig. 2, the service a includes a plurality of independent services, i.e., service 1, service 2 through service n, each service having a plurality of instances distributed therein. Wherein, there is a request and response relation on business between service 1 and service 2. The operation index generated in the process of the service 2 requesting the service 3 may be used as the operation index of the service 2, or the operation index generated in the process of the service 2 responding to the service 1 may be used as the operation index of the service 2.
Where, sec1 is a process in which service 1 requests service 2, and SecX is a process in which service 2 results notify a response to service 1.
And under the condition that the service system is determined to have faults, judging each service of the service system, and positioning the service causing the service system faults in the services of the service system according to the judgment result.
Optionally, the operation indexes of the services of the service system at multiple times in the current time period are obtained, and then the operation indexes at multiple times in the current time period are combined to judge whether each service has a fault.
The selection module 403 is configured to select a target service from all services of the service system according to a determination result of each service;
the judgment result of each service includes two conditions of existence of fault or nonexistence of fault.
When selecting a target service, determining the number of services with faults in a service system;
when a service with a fault exists in a service system, selecting the service with the fault from all the services of the service system as a target service; wherein the target service is a service causing a failure of the business system.
When there are a plurality of failed services in the service system, the plurality of failed services may be targeted, or one or more services may be selected from the plurality of failed services. And then, according to fault positioning of the instances under the target service, further acquiring the instances causing the target service fault, and further accurately positioning the source causing the service system fault.
The fault location module 404 is configured to determine whether each instance of the target service has a fault according to an operation index of each instance in the current time period, and obtain a fault location result of the service system according to a determination result of each instance.
Specifically, under the condition that the target system is obtained, each service of the target system is judged, and an example causing the target system fault in the examples of the target service is positioned according to the judgment result.
Optionally, the operation indexes of the instances at multiple times in the current time period are combined to determine whether the instances have faults.
When the instance in the target service has a fault, taking the instance with the fault as a fault positioning result of the business system;
when no fault exists in the instances of the target service, all the instances of the target service are used as the fault location result of the business system, namely, the root cause of the fault of the business system is the service level fault generated by the target service.
According to the implementation, firstly, fault detection is carried out on a business system in real time, fault detection is carried out on each service of the business system under the condition that the business system has faults so as to determine target services, then fault detection is carried out on each instance of the target services, and further the fault reason of the business system is positioned, on one hand, fault detection is carried out only on the instances of the target services with the faults, so that the calculated amount is greatly reduced, and the positioning efficiency is effectively improved; on the other hand, the fault reason of the service system is positioned layer by layer according to the hierarchical relation, so that the positioning result is more accurate.
On the basis of the foregoing embodiment, in this embodiment, the first determining module is specifically configured to: inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system; comparing the anomaly score with a first preset threshold; judging whether the service system has a fault according to the comparison result; the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.
On the basis of the above embodiment, the embodiment further includes a calculating module specifically configured to: inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system; subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value; and taking the maximum value in the absolute value result as the abnormal score of the business system.
On the basis of the foregoing embodiments, the second determining module in this embodiment is specifically configured to: fusing the operation indexes of each service at a plurality of moments in the current time period, fusing the operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the service system at a plurality of moments in the current time period; subtracting the fusion results of the operation indexes of each service in the current time interval and the previous time interval of the current time interval, and calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service; and comparing the abnormal score of each service with a second preset threshold, and judging whether each service has a fault according to the comparison result.
On the basis of the above embodiment, the present embodiment further includes a fusion module specifically configured to: and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period.
On the basis of the above embodiment, the module is specifically selected in this embodiment to: if the service with the fault is one, taking the service with the fault as the target service; if the number of the services with faults is multiple, selecting the service executed firstly from the services with faults as a target service according to the execution sequence of the services with faults in the service system; and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.
On the basis of the foregoing embodiments, the third determining module in this embodiment is specifically configured to: fusing the operation indexes of each instance at a plurality of moments in the current time period, fusing the operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the target service at a plurality of moments in the current time period; subtracting the fusion results of the operation indexes of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value; calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period; taking the sum of the maximum value in the absolute value result corresponding to each instance and the similarity result corresponding to each instance as the abnormal score of each instance; and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to a comparison result.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a business system fault location method comprising: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the service system fault location method provided by the above methods, the method includes: judging whether the service system has a fault or not according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the service system fault location methods provided above, the method including: judging whether the service system has a fault according to the operation index of the service system in the current time period; if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period; selecting a target service from all services of the service system according to the judgment result of each service; and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for locating a fault of a service system is characterized by comprising the following steps:
judging whether the service system has a fault according to the operation index of the service system in the current time period;
if the service system has a fault, judging whether each service has a fault according to the operation index of each service of the service system in the current time period;
selecting a target service from all services of the service system according to the judgment result of each service;
and judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring a fault positioning result of the service system according to the judgment result of each instance.
2. The method of claim 1, wherein the determining whether the service system has a fault according to the operation index of the service system in the current time period comprises:
inputting the operation index of the service system in the current time period into an abnormality detection model, and outputting an abnormality score of the service system;
comparing the abnormality score with a first preset threshold;
judging whether the service system has a fault according to the comparison result;
the abnormal detection model takes the operation index of the business system in a historical period as a sample, and takes the abnormal score corresponding to the sample as a sample label for training and obtaining.
3. The method for locating faults of a business system according to claim 2, wherein the step of inputting the operation indexes of the business system in the current time period into an abnormality detection model and outputting the abnormality scores of the business system comprises the steps of:
inputting the operation index of the service system in the current time period into the abnormality detection model, and outputting the reconstructed operation index of the service system;
subtracting the operation indexes of the service system after reconstruction and before reconstruction, and calculating an absolute value;
and taking the maximum value in the absolute value result as the abnormal score of the business system.
4. The method for locating faults in a service system according to any one of claims 1 to 3, wherein the determining whether faults exist in each service according to the operation index of each service of the service system in the current time period includes:
fusing the operation indexes of each service at a plurality of moments in the current time period, fusing the operation indexes of each service at a plurality of moments in the previous time period of the current time period, and fusing the operation indexes of the service system at a plurality of moments in the current time period;
subtracting the fusion results of the operation indexes of each service in the current time interval and the previous time interval of the current time interval, and calculating an absolute value;
calculating the similarity between the fusion result of the operation indexes of each service in the current time period and the fusion result of the operation indexes of the service system in the current time period;
taking the sum of the maximum value in the absolute value result corresponding to each service and the similarity result corresponding to each service as the abnormal score of each service;
and comparing the abnormal score of each service with a second preset threshold value, and judging whether each service has a fault according to the comparison result.
5. The method according to claim 4, wherein the fusing the operation indexes of each service at a plurality of times in the current time period comprises:
and calculating one or more of a statistical characteristic value, an exponential weighted moving average value and a wavelet decomposition value of each operation index of each service at a plurality of moments in the current time period.
6. The method according to claim 4, wherein the selecting a target service from all services of the service system according to the determination result of each service comprises:
if the service with the fault is one, taking the service with the fault as the target service;
if the service with the fault is multiple, selecting the service which is executed firstly from the service with the fault as a target service according to the execution sequence of the service with the fault in the service system;
and if all services have no fault, taking the service with the maximum abnormal score in the business system as a target service.
7. The method for locating faults in a business system according to any one of claims 1 to 3, wherein the step of judging whether faults exist in each instance according to the operation index of each instance of the target service at each moment in the current time period comprises the following steps:
fusing operation indexes of each instance at a plurality of moments in the current time period, fusing operation indexes of each instance at a plurality of moments in the previous time period of the current time period, and fusing operation indexes of the target service at a plurality of moments in the current time period;
subtracting the fusion result of each operation index of each instance in the current time interval and the previous time interval of the current time interval, and then calculating an absolute value;
calculating the similarity between the fusion result of the operation indexes of each instance in the current time period and the fusion result of the operation indexes of the target service in the current time period;
taking the sum of the maximum value in the absolute value result corresponding to each example and the similarity result corresponding to each example as the abnormal score of each example;
and comparing the abnormal score of each example with a third preset threshold, and judging whether each example has a fault according to the comparison result.
8. A service system fault location device, comprising:
the first judgment module is used for judging whether the service system has a fault or not according to the operation index of the service system in the current time period;
the second judgment module is used for judging whether each service has a fault according to the operation index of each service of the service system in the current time period if the service system has the fault;
the selection module is used for selecting a target service from all services of the business system according to the judgment result of each service;
and the fault positioning module is used for judging whether each instance has a fault according to the operation index of each instance of the target service in the current time period, and acquiring the fault positioning result of the service system according to the judgment result of each instance.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the business system fault location method according to any one of claims 1 to 7.
10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the business system fault localization method according to any one of claims 1 to 7.
CN202110587616.3A 2021-05-27 2021-05-27 Service system fault positioning method and device Pending CN115408182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110587616.3A CN115408182A (en) 2021-05-27 2021-05-27 Service system fault positioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110587616.3A CN115408182A (en) 2021-05-27 2021-05-27 Service system fault positioning method and device

Publications (1)

Publication Number Publication Date
CN115408182A true CN115408182A (en) 2022-11-29

Family

ID=84155639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110587616.3A Pending CN115408182A (en) 2021-05-27 2021-05-27 Service system fault positioning method and device

Country Status (1)

Country Link
CN (1) CN115408182A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431341A (en) * 2023-03-30 2023-07-14 浙江大学 Resource specification adjustment method, device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116431341A (en) * 2023-03-30 2023-07-14 浙江大学 Resource specification adjustment method, device and storage medium

Similar Documents

Publication Publication Date Title
WO2022068645A1 (en) Database fault discovery method, apparatus, electronic device, and storage medium
CN111652496B (en) Running risk assessment method and device based on network security situation awareness system
CN108737406B (en) Method and system for detecting abnormal flow data
CN112087334B (en) Alarm root cause analysis method, electronic device and storage medium
CN110912737A (en) Dynamic perception performance early warning method based on hybrid model
CN110569166A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and medium
CN112835769A (en) Service data abnormity diagnosis method, device, equipment and storage medium
CN114781510A (en) Fault positioning method, device, system and storage medium
CN113900844A (en) Service code level-based fault root cause positioning method, system and storage medium
CN111027591B (en) Node fault prediction method for large-scale cluster system
CN111858265A (en) Storage fault prediction method, system and device of storage system
CN111784173A (en) AB experiment data processing method, device, server and medium
CN115408182A (en) Service system fault positioning method and device
CN113518367B (en) Fault diagnosis method and system based on service characteristics under 5G network slice
CN113850669A (en) User grouping method and device, computer equipment and computer readable storage medium
CN113541986B (en) Fault prediction method and device for 5G slice and computing equipment
CN116468967B (en) Sample image screening method and device, electronic equipment and storage medium
CN115481694B (en) Data enhancement method, device and equipment for training sample set and storage medium
WO2020258509A1 (en) Method and device for isolating abnormal access of terminal device
CN115599077A (en) Vehicle fault delimiting method and device, electronic equipment and storage medium
CN116112209A (en) Vulnerability attack flow detection method and device
CN116723083B (en) Cloud server online fault diagnosis method and device
CN112153685B (en) RRC fault detection method and device
CN115017014B (en) Highway electromechanical monitoring system and method
CN115878415A (en) Cluster server intelligent fault prediction method, system, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination