CN113037550B - Service fault monitoring method, system and computer readable storage medium - Google Patents

Service fault monitoring method, system and computer readable storage medium Download PDF

Info

Publication number
CN113037550B
CN113037550B CN202110241842.6A CN202110241842A CN113037550B CN 113037550 B CN113037550 B CN 113037550B CN 202110241842 A CN202110241842 A CN 202110241842A CN 113037550 B CN113037550 B CN 113037550B
Authority
CN
China
Prior art keywords
service
host
interface
fault
peripheral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110241842.6A
Other languages
Chinese (zh)
Other versions
CN113037550A (en
Inventor
苏君福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Icsoc Beijing Communication Technology Co ltd
Original Assignee
Icsoc Beijing Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Icsoc Beijing Communication Technology Co ltd filed Critical Icsoc Beijing Communication Technology Co ltd
Priority to CN202110241842.6A priority Critical patent/CN113037550B/en
Publication of CN113037550A publication Critical patent/CN113037550A/en
Application granted granted Critical
Publication of CN113037550B publication Critical patent/CN113037550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis

Abstract

The invention provides a service fault monitoring method, a system and a computer readable storage medium, wherein the service fault monitoring method comprises the following steps: acquiring information of service fault alarm; acquiring the association information between the host and the peripheral host; acquiring the association information of the host upper interface and the peripheral host upper interface; performing cluster analysis on the host computer associated with the service and the interfaces associated with the service according to the information of the service fault alarm, the associated information between the host computer and the host computer, and the associated information of the interfaces on the host computer and the peripheral host computer to obtain a first analysis result, wherein the first analysis result comprises the service function and the service range of the host computer; and traversing and analyzing the host upper interface and the peripheral host upper interface according to the service failure alarm information and the correlation information of the host upper interface and the peripheral host upper interface to obtain a second analysis result, and determining a failure influence range according to the first analysis result and the second analysis result.

Description

Service fault monitoring method, system and computer readable storage medium
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a service fault monitoring method, a system and a computer readable storage medium.
Background
In recent years, with the rapid development of internet technology, the size of a network service system and the complexity between internal modules are increasing, thereby causing difficulty in diagnosis of service failure to be increasing. For a huge and complex network environment under cloud computing, timely discovery of application faults becomes important on the premise of avoiding influencing the use of clients. Therefore, a method for monitoring faults is needed to diagnose faults and prevent damage effectively in time.
In the past, only when a client cannot normally use the system and has a problem, feedback is carried out, and related personnel carry out troubleshooting.
However, the fault removing process is relatively complex, consumes more labor cost and time cost, consumes too long time in part of fault diagnosis processes, and is difficult to diagnose and stop damage timely and effectively.
Disclosure of Invention
In order to solve the technical problems that the troubleshooting process is relatively complex, much labor cost and time cost are consumed, part of fault diagnosis process consumes too long time, and fault diagnosis and loss stopping are difficult to perform timely and effectively, the invention provides a service fault monitoring method, a service fault monitoring system and a computer-readable storage medium.
The specific technical scheme of the invention is as follows:
the invention provides a service fault monitoring method, which comprises the following steps:
acquiring information of service fault alarm;
acquiring the association information between the host and the peripheral host;
acquiring the association information of the host upper interface and the peripheral host upper interface;
performing cluster analysis on the host computer associated with the service and the interfaces associated with the service according to the information of the service fault alarm, the associated information between the host computer and the host computer, and the associated information of the interfaces on the host computer and the peripheral host computers to obtain a first analysis result, wherein the first analysis result comprises the service function and the service range of the host computer;
traversing and analyzing the host upper interface and the peripheral host upper interface according to the service failure alarm information and the correlation information of the host upper interface and the peripheral host upper interface to obtain a second analysis result, wherein the service is in one-to-one correspondence with the host upper interface;
and determining a fault influence range according to the first analysis result and the second analysis result.
In an optional embodiment, the performing traversal analysis on the host interface and the peripheral host interface includes:
acquiring the weight between the host upper interface and the peripheral host upper interface;
sorting the weights in a descending order;
and traversing and analyzing the host upper interface and the peripheral host upper interfaces in the order from large to small according to the weight.
In an optional embodiment, the performing traversal analysis on the host upper interface and the peripheral host upper interface according to a descending order of weight includes:
acquiring a service function and a service range of a service with a fault;
acquiring service functions and service ranges of an upper interface of a host with a fault and upper interfaces of peripheral hosts;
and traversing and analyzing according to the service function and the service range of the failed service, the service functions and the service ranges of the upper interfaces of the failed host and the upper interfaces of the peripheral hosts in the order from large to small of the weight.
In an optional embodiment, the performing, according to the service function and the service range of the failed service and the service functions and the service ranges of the failed host upper interface and the peripheral host upper interfaces, traversal analysis is performed in an order from large to small according to the weight, and includes:
comparing the service function and the service range of the fault service, and the service functions and the service ranges of the upper interface of the fault service host and the upper interfaces of the peripheral hosts to obtain a comparison result;
acquiring a time sequence of the service fault according to the comparison result;
acquiring the time sequence weight of the failure of the host upper interface and the peripheral host upper interface;
and traversing and analyzing the host upper interface and the peripheral host upper interface according to the time sequence of the service failure and the time sequence weight of the host upper interface and the peripheral host upper interface failure.
In an optional embodiment, the obtaining a time sequence of the service failure according to the comparison result includes:
and when the service function and the service range of the service with the fault are the same as those of the interfaces on the host with the fault and the interfaces on the peripheral hosts, acquiring the time sequence of the service with the fault.
In an optional embodiment, the performing traversal analysis on the host upper interface and the peripheral host upper interface according to the time series when the service fails and the time series weight when the host upper interface and the peripheral host upper interface fail includes:
comparing the time sequence of the service failure with the time sequence of the host upper interface and the peripheral host upper interface failure to obtain a comparison result;
and obtaining a target fault interface according to the comparison result.
In an optional embodiment, the obtaining a target fault interface according to the comparison result includes:
taking an interface which is positioned before the service failure time and has a pre-causality relation with the service failure as a target failure interface;
preferably, the interfaces which are located before the service failure time and have a pre-causality relation with the service failure are sorted according to the time sequence weights from large to small, and the interface with the largest weight is used as the target failure interface.
In an optional embodiment, after determining the fault influence range according to the first analysis result and the second analysis result, the method further includes: determining the fault influence type and the influence level according to the service function and the service range of the host and the service functions and the service ranges of the host upper interface and the peripheral host upper interfaces;
and generating a fault processing action according to the fault influence type and the influence level.
In another aspect, a service failure monitoring system is provided, the system comprising: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement any of the methods described above.
In a further aspect, there is provided a computer readable storage medium having stored therein computer executable instructions for implementing a service failure monitoring method as described in any one of the above when executed by a processor.
The invention has the following beneficial effects: according to the method provided by the embodiment of the application, each host and the interfaces associated with the host do not need to be checked one by one manually, the server can analyze the associated information between the hosts and the associated information between the interfaces on the peripheral hosts, the relationship between the service and the hosts is in one-to-one correspondence, the checking time can be shortened during checking, the checking efficiency is improved, and the checking accuracy is improved.
Drawings
Fig. 1 is a service module calling relationship diagram provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a service fault monitoring method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an example of a service invocation interface provided by an embodiment of the present application at runtime;
FIG. 4 is a schematic flow chart illustrating monitoring after a service failure according to an embodiment of the present application;
fig. 5 is a schematic diagram of the fault processing speed and time after the method provided by the embodiment is adopted.
Detailed Description
The invention is explained in more detail below with reference to the figures and the following examples.
At present, the prior art mainly diagnoses faults by manual work. Specifically, the operation and maintenance personnel investigate the network service system according to a module call relation diagram, wherein the module call relation diagram is shown in fig. 1 for example. In most cases, the failure is discovered as a result of many failed requests occurring on the most upstream front-end module (module a shown in fig. 1). At this time, the operation and maintenance person goes down along the module a. Since the module A calls the module B, the index of the module B needs to be checked, and if the index of the module B is abnormal, the module B is suspected to cause a fault. Then, the module C immediately downstream of the module B is examined, and so on. In this process, call relationships through the modules are suspected to be passed down until they are not passed down. In the example shown in fig. 1, it is suspected that module G is stopped last. Of course, the real scenario is more complicated, and it is not only necessary for the downstream module to have an exception, but also to consider the degree of the exception, which is only illustrated by way of example here for easy understanding. For example, if the degree of abnormality of the module G is much smaller than that of the module E, the root cause of the failure is more likely to be in the module E.
After the fault root cause module is determined, the fault root cause is analyzed, so that the finding of the fault root cause module is an important step in fault diagnosis. Because large-scale services are deployed on thousands of servers, each server has dozens to hundreds of service monitoring indexes, and diagnosing faults through manual analysis and troubleshooting consumes a large amount of time and labor, and it is difficult to diagnose faults timely and effectively and stop losses. Based on the above problems, embodiments of the present application provide a service fault monitoring method and apparatus, which aim to solve the above technical problems.
Referring to fig. 1, fig. 1 is a schematic flow chart of a service fault monitoring method according to an embodiment of the present disclosure. The service fault monitoring method comprises the following steps:
s101, acquiring information of service fault alarm.
S102, acquiring the association information between the host and the peripheral host.
S103, acquiring the associated information of the host machine upper interface and the peripheral host machine upper interface.
S104, performing cluster analysis on the host computer associated with the service and the interfaces associated with the service according to the information of the service fault alarm, the associated information between the host computer and the host computer, and the associated information of the interfaces on the host computer and the interfaces on the peripheral host computers to obtain a first analysis result, wherein the first analysis result comprises the service function and the service range of the host computer.
And S105, traversing and analyzing the host upper interface and the peripheral host upper interfaces according to the information of the service fault alarm and the correlation information of the host upper interface and the peripheral host upper interfaces to obtain a second analysis result, wherein the service and host upper interfaces are in one-to-one correspondence.
And S106, determining a fault influence range according to the first analysis result and the second analysis result.
According to the method provided by the embodiment of the application, the host is associated with the peripheral host to obtain the associated information of the host and the peripheral host, and the host upper interface is associated with the peripheral host upper interface to obtain the associated information of the host and the peripheral host interface. When the service gives an alarm, performing cluster analysis on a host computer associated with the service and an interface associated with the service to obtain a first analysis result, wherein the first analysis result comprises a host computer service function and a service range; traversing and analyzing the host upper interface and the peripheral host upper interfaces to obtain a second analysis result, wherein the second analysis result comprises the service function and the service range of the service interface; and determining the fault influence range according to the first analysis result and the second analysis result. According to the method provided by the embodiment of the application, each host and the associated interfaces do not need to be checked one by one manually, the server can analyze the associated information between the hosts and the associated information between the interfaces on the hosts and the associated information of the interfaces on the peripheral hosts, and the relationship between the service and the hosts is in one-to-one correspondence, so that the checking time can be shortened during checking, the checking efficiency is improved, and the checking accuracy is improved.
The method provided by the embodiments of the present application will be further described below by way of alternative embodiments.
It should be noted that the main body of the method provided by the embodiment of the present application may be a server, and the server may be installed on a terminal such as a mobile phone, a tablet computer, and an upper computer, which is not limited in the embodiment of the present application.
S101, acquiring information of service fault alarm.
When the service fails, an alarm can be given, and at the moment, the alarm information can be obtained through the server. This alarm information can be for the warning suggestion through terminal screen display, also can be through the warning sound that the alarm sent etc. as long as can be obtained by the server can, this application embodiment is not restricted to this to the form of reporting to the police.
S102, acquiring the associated information between the host and the peripheral host.
It can be understood that a service is connected to a host, but when the service is enabled, it may need to call a peripheral host to complete tasks or call data of other hosts, and at this time, it needs to call the host associated with the host through the association information to analyze and determine which host has failed. In an alternative embodiment, the association information between the host and the peripheral host may include an association relationship between the host and the peripheral host, for example, if the host a calls the host B, and the host B calls the host C, an association relationship of an association relationship a- > B- > C is generated between the hosts A, B and C; identity information between hosts, such as the IDs of hosts A, B and C; service function, service scope, service parameters, etc. of the host.
S103, acquiring the associated information of the host machine upper interface and the peripheral host machine upper interface.
It will be appreciated that when a service fails to alarm, it may be possible that the service fails not because the host connected to the service or the interface on the host fails, but rather because the peripheral host associated therewith or the interface thereon fails. Therefore, the correlation information of the interfaces on the host and the interfaces on the peripheral host is obtained, so that the reason of service alarm can be better searched.
As an example, in the embodiment of the present application, the service host includes two hosts, namely, a host numbered 01 and a host numbered 02, the host numbered 01 serves as a login service, the host numbered 02 serves as a statistics service, when a user uses the system, the user needs to log in through the host numbered 01 and transmit data to the host numbered 02 for statistics, the host numbered 01 has three login interfaces a, b, and c, and the host numbered 02 has a statistics interface d. When the host 02 gives an alarm, not only the interface d needs to be checked, but also the interfaces a, b and c which have the association relation with the interface d and are positioned on the host 01 need to be checked. For example, when the data transmitted from the login interface a to the statistical interface d is found to be not problematic by checking, which indicates that the login interface a is normal, the remaining login interfaces are checked until the location of the failed interface is found.
Further, each interface on the host provided in the embodiment of the present application has a unique ID, and during the inspection, the interface that has been inspected may be marked according to the ID of each interface, so as to avoid repeated inspection of the interface that has been inspected. And the fault position can be accurately and quickly searched according to the ID of each interface.
In an optional embodiment, the present application embodiment embeds probe codes in each host interface in an APM breakpoint manner, and when a service calls a host and an interface, the APM probe can acquire an association relationship between the host and the host, and between an interface on the host and a peripheral called interface, and collect the association relationship. When the service alarms, the incidence relation between the interface on the host and the peripheral interface can be obtained, and the fault position can be found through the incidence relation.
It should be noted that the host interface and the peripheral host interface provided in the embodiments of the present application include a newly established interface and an original interface. It can be understood that the original interface cannot be changed, so that the relationship between the interfaces can be obtained in a network connection manner; for a newly developed interface, APM probe codes can be embedded into the interface when the interface is developed, and when the interface is called, the association relationship between the interfaces can be obtained through the APM probe and transmitted to a server of the monitoring system when needed.
The APM mode can realize that each service corresponds to one host and each service corresponds to one interface on one host, namely, the one-to-one correspondence between the service and the host and the one-to-one correspondence between the service and the interfaces on the host can be realized, when the service fails, the failure position can be quickly and timely found, and the monitoring efficiency is improved.
It can be understood that, in this embodiment, the association information includes an association relationship between the host upper interface and the peripheral host interface, and identity information between the host upper interface and the peripheral host interface, so that when a service alarm is given, a location where a fault occurs can be quickly found through the association relationship and the identity information, thereby improving monitoring efficiency.
In an optional embodiment, the association information between the host upper interface and the peripheral host upper interface provided in the embodiment of the present application includes distance weight information between the host upper interface and the peripheral host upper interface, and time weight information between the host upper interface and the peripheral host upper interface.
The weight refers to the importance of a factor or index relative to an event, which is different from the general weight, and represents not only the percentage of the factor or index, but also the relative importance of the factor or index, which tends to contribute to the degree or importance. In the embodiment of the application, the distance weight information of the host associated with the service is acquired, the clustered hosts associated with the service and the interfaces associated with the service are sorted according to the distance weight of the host associated with the service and the distance weight of the interfaces associated with the service, and after the service alarm, the hosts associated with the service and the interfaces associated with the service are searched according to the sorting of the distance weight, so that the efficiency is improved compared with the searching in the hosts out of order.
Referring to fig. 3, fig. 3 is a diagram illustrating an example of an interface invoked by a service in runtime according to an embodiment of the present application. As an example, although service a is connected to an interface on a host, service a may invoke service B when used, which in turn invokes service C, which in turn invokes service D. In the above process, the distance between service a and service C is greater than the distance between service a and service B, that is, service a calls service C to pass service B, and the distance weight of service B to a is greater than that of service C to service a.
Similarly, each interface has different service functions and ranges, the importance between the service functions and ranges and the services is high or low, and according to the weight of the service, the efficiency is improved in searching compared with searching in an unordered interface.
As an example, although the service a is connected to an interface on a host, the service a may call the interface b through the interface a, and the interface b calls the interface c, and the interface c calls the interface d when in use. In the above process, the distance between the interface a and the interface b is greater than the distance between the interface a and the interface c, that is, the interface a calls the interface c to pass through the interface b, and the distance weight of the interface b to the interface a is greater than the distance weight of the interface c to the interface a.
It should be noted that each service may have a time sequence during operation, and the time sequence stores the operating state and attribute of the service at each time point, and when the service fails to alarm, a record of the alarm may also appear in the time sequence of the service, that is, the alarm time may be displayed in the time sequence. It will be appreciated that the generation of the alarm result will have a factor in the generation of the alarm, i.e. the failure of each interface will also be reflected in the time series, based on which the cause of the service failure is derived. Therefore, the efficiency of searching for the fault can be improved by setting the association information of the host upper interface and the peripheral host upper interface to include the time weight information of the host upper interface and the peripheral host upper interface.
In an optional embodiment, when the monitoring method is implemented, a host and service interface function module dictionary library may be defined, and the host and the service interface are connected in series and store the association information into the dictionary library for rapid traversal analysis.
And defining a dictionary library of influence ranges of the host and the service interface modules, wherein each host and each service interface respectively correspond to a service function module description and an influence range.
In the method provided by the embodiment of the application, the monitored system can continuously update the associated information of the host and the peripheral host stored in the host and service interface module influence range dictionary library and the associated information of the interfaces on the host and the peripheral host in the use process, and monitors the fault service on the basis of the updated data when monitoring the service each time. Therefore, the monitoring system and the monitoring method ensure that the real-time latest data can be used as the standard when the service fault in the system is monitored, and improve the monitoring efficiency.
S104, performing cluster analysis on the host computer associated with the service and the interfaces associated with the service according to the information of the service fault alarm, the associated information between the host computer and the host computer, and the associated information of the interfaces on the host computer and the interfaces on the peripheral host computers to obtain a first analysis result, wherein the first analysis result comprises the service function and the service range of the host computer.
And after the service alarms, the server performs cluster analysis on the host related to the service and the interface related to the service according to the alarm information, the obtained associated information between the host and the host, and the associated information of the interface on the host and the interfaces on the peripheral hosts. It can be understood that, one service is connected to an interface on one host, that is, each interface on the host has a different service range and function, in the embodiments of the present application, a cluster analysis is performed on the hosts associated with the service and the interfaces associated with the service, for example, the hosts and the interfaces of the hosts that have the same service range or the same service function are aggregated and classified, and then the classified hosts and interfaces on the hosts are checked, so that the monitoring efficiency can be improved.
The first analysis result comprises host service functions and service ranges, and it is mentioned above that each host has different service functions and service ranges, one service is connected with one interface on one host, and the function and range of each interface are different. And (4) checking the position of the fault position according to the service range and the service function of the host.
And S105, traversing and analyzing the host upper interface and the peripheral host upper interfaces according to the information of the service fault alarm and the correlation information of the host upper interface and the peripheral host upper interfaces to obtain a second analysis result, wherein the service and host upper interfaces are in one-to-one correspondence.
Traversal refers to making one visit to each node in the tree (or graph) in turn along a search route. The operation performed by the access node depends on the specific application problem, and the specific access operation may be to check the value of the node, update the value of the node, and the like. In the embodiment of the present application, the traversal analysis refers to making one access to each host upper interface related to the service and the peripheral host upper interfaces.
In an optional embodiment, in S105, performing traversal analysis on the host interface and the peripheral host interfaces in the order of decreasing weights includes:
and acquiring the service function and the service range of the service with the fault.
As mentioned above, each interface corresponds to a specific service function and a service range, as an example, the WeChat has a variety of service functions, such as a friend service, a payment service, a location service, and the like, and the service range of the friend service includes a service for classifying friends, a service for sorting classified friends, and the like. By acquiring the service function and the service range of the service with the fault, the corresponding service function and the service interface in the service range can be better checked, unnecessary workload is reduced, and the working efficiency is improved.
And acquiring the service functions and the service ranges of the upper interfaces of the failed host and the upper interfaces of the peripheral hosts.
The host can be judged to have a fault through the service alarm and the associated information of the host and the peripheral hosts, and the service function and the service range of the host upper interface and the peripheral host upper interface are obtained, so that the interfaces different from the service function and the service range of the service with the fault are eliminated, the interfaces without any associated relation are prevented from being checked, and the monitoring efficiency is improved.
And traversing and analyzing according to the service function and service range of the failed service and the service functions and service ranges of the upper interfaces of the failed host and the upper interfaces of the peripheral hosts in the order from large to small according to the weight.
It can be understood that the service functions and service ranges of the interfaces on the failed host and the interfaces on the peripheral hosts also have a large or small difference in the meaning of the failed service, that is, the probability that the failed service fails due to the interfaces on the failed host and the interfaces on the peripheral hosts is different, and the interfaces are sorted by the order of the weights from large to small, and then are subjected to traversal analysis according to the order from large to small, so that the efficiency of the traversal analysis can be improved.
In an optional embodiment, performing traversal analysis according to the service function and service range of the failed service and the service functions and service ranges of the failed host upper interface and the peripheral host upper interfaces in an order from large to small by weight includes:
and comparing the service function and the service range of the failed service, and the service functions and the service ranges of the upper interfaces of the failed service host and the upper interfaces of the peripheral hosts to obtain a comparison result.
It can be understood that when the service function and the service range of the failed service are different, and the service function and the service range of the failed service host and the service function and the service range of the peripheral host are different, it indicates that there is no relationship between them, and there is no need to analyze them, so the service function and the service range of the interface are compared first, and traversal analysis is performed on them when there is correlation, which reduces unnecessary workload.
And acquiring the time sequence of the service fault according to the comparison result.
And acquiring a time sequence of the failure of the host computer upper interface and the peripheral host computer upper interface.
And traversing and analyzing according to the time sequence of the service failure and the time sequences of the host upper interface and the peripheral host upper interface failure.
When the service function and service range of the service with the fault are the same, and the service function and service range of the upper interface of the service host with the fault are the same as those of the interfaces of the peripheral hosts, the time sequence of the service with the fault is obtained. As an example, the server finds that the time of the WeChat friend service failure is 12.05 points through traversal analysis. The time for the payment function to fail is 15.05 minutes. The time sequence of the failure of the host interface and the peripheral host interface is 11.45, 11.59, 12.04, 12.50, 15.00, 15.04, etc. And traversing and analyzing according to the acquired time sequence of the service failure and the time sequences of the host upper interfaces and the peripheral host upper interfaces.
In an optional embodiment, obtaining the time series of the service failure according to the comparison result includes:
and when the service function and the service range of the service with the fault are the same as those of the interfaces on the host with the fault and the interfaces on the peripheral hosts, acquiring the time sequence of the service with the fault.
In an optional embodiment, performing traversal analysis according to the time series of service failures and the time series of failures of the interfaces on the host and the interfaces on the peripheral hosts includes:
comparing the time sequence of the service failure with the time sequences of the host upper interface and the peripheral host upper interface failure to obtain a comparison result;
and obtaining a target fault interface according to the comparison result.
In an optional embodiment, obtaining the target failure interface according to the comparison result includes:
and taking an interface which is positioned before the service failure time and has a preposed cause and effect relationship with the service failure as a target failure interface.
It can be understood that, before the failure occurs, an interface connected to the service must have a failure, and therefore, a comparison result is obtained by acquiring a time series of failures of the interface on the host and the interfaces on the peripheral hosts and comparing the time series of failures with the time series of failures of the service. As an example, the comparison result may include three cases, the first: the time when the host upper interface and the peripheral host upper interface fail is before the time when the service fails; and the second method comprises the following steps: the time when the host upper interface and the peripheral host upper interface are in failure is behind the time when the service is in failure; and the third is that: the time when the interface on the host computer and the interface on the peripheral host computer fail is the same as the time when the service fails. It can be understood that, based on the pre-causality, the service may fail only if the time when the interface on the host and the interface on the peripheral host fail is before the time when the service fails. Therefore, the interface which is positioned before the service failure time and has a pre-causality relationship with the service failure is taken as a target failure interface.
In an optional embodiment, the interface which is located before the service failure time and has a pre-causal relationship with the service failure is taken as a target failure interface, and the interface which is located before the service failure time and has a pre-causal relationship with the service failure is sorted according to the time series weight from large to small, and the interface with the largest weight is taken as the target failure interface.
As an example, the server finds that the time of the WeChat friend service failure is 12.05 minutes through traversal analysis. The time sequence of the failure of the host upper interface and the peripheral host upper interface is respectively a: 11.45, b: 11.59, c: 12.04, d: 12.50, e: 15.00, f: 15.04, i.e. the above-mentioned one time point corresponds to one interface. It is clear that the interface located 12.05 minutes ago has three interfaces a, b, c, but the interface closest to 12.05 minutes is the c interface with the time point of 12.04, so the c interface with the failure at the time point is taken as the target failure interface.
And S106, determining a fault influence range according to the first analysis result and the second analysis result.
In an optional embodiment, after determining the fault influence range according to the first analysis result and the second analysis result, the method further includes: and generating a fault processing action according to the fault influence range.
According to the embodiment of the application, accidents corresponding to various faults can be stored in advance, and after the influence range is determined, the corresponding processing action is selected according to the influence range.
In an optional embodiment, generating the fault handling action according to the fault influence range includes:
determining the fault influence type and the influence level according to the service function and the service range of the host and the service function and the service range of the interface;
and generating a fault processing action according to the fault influence type and the influence level.
As an example, the object monitored in the embodiment of the application is a WeChat program, and when a user finds that the friend function display is incomplete after logging in WeChat, the problem occurs in the range of friend service, instead of the problem occurring in services such as payment and messaging, and therefore, the fault position can be directly searched in the range of friend display according to the problem. Further, whether the friend displays all problems or part of problems can be judged, and if part of problems occur, only part of friend services need to be checked for processing actions.
In an optional embodiment, before determining the fault influence range according to the first analysis result and the second analysis result, the method further includes:
obtaining historical data of a fault influence range, comparing the current fault influence range data with the historical data of the fault influence range to obtain a first comparison result, and determining a fault target fault influence range according to the first comparison result.
It can be understood that, because the number of the services is large, and the number of hosts and interfaces related to each service is also large, by comparing the current fault influence range data with the historical fault range data, when the current fault influence range data is the same as or similar to the historical fault range data, it is indicated that the obtained fault influence range data is accurate, and the processing action can be generated according to the data.
In an optional embodiment, the fault influence range historical data comprises the time point when the fault occurs and the fault point record condition. According to the three-four-three law of fault processing, a fault is divided into three time periods, namely fault finding time, fault response time and fault processing time; the fault has four time points, namely a fault occurrence time, a fault discovery time, a fault starting processing time and a fault recovery time; three things need to be done to handle the failure, namely decision making, recovery and notification. By recording the historical data of the fault, the historical fault occurrence and processing conditions can be referred to when the fault occurs next time.
In an optional embodiment, generating the fault handling action according to the fault influence range includes:
and acquiring a fault history processing action generated within the fault influence range, comparing the current processing action with the history processing action according to a preset mode to obtain a second comparison result, and generating the current processing action according to the second comparison result.
The embodiment of the application can also store historical data of the fault influence range, compare the current fault influence range with the historical data of the fault influence range, and when the current fault influence range is the same as the historical fault influence range, that is, the second comparison result is that the current fault influence range is the same as the historical data of the fault influence range, the processing action which is the same as the historical fault influence range can be adopted. When the second comparison result shows that the current fault influence range is different from the historical data of the fault influence range, the processing action similar to the historical fault can be adopted, and therefore the fault processing efficiency is improved.
In an optional embodiment, when the monitoring method is implemented, an accident handling action module may be written, and host interface services related to an accident are captured through traversal analysis associated with the accident.
And compiling an accident notification interface module, pushing the service information of the traversed and captured host interface to the accident notification interface module, and notifying an accident handling engineer.
And the accident recording module is compiled, and three events of three time periods, four time points and three events of the accident are processed and recorded to the accident management platform.
In an optional embodiment, the method further comprises: and acquiring the associated information between the host and the host, the host upper interface and the peripheral host upper interface in an undirected graph mode.
The undirected graph is a graph without direction, and the embodiment of the application adopts an undirected graph mode to identify the association information between the host and the host, the interfaces on the host and the interfaces on the peripheral host, so that the association relationship between the host and the association relationship between the interfaces on the host and the peripheral host can be well represented, and the method is clear and concise.
Referring to fig. 4, fig. 4 is a schematic view illustrating a monitoring process after a service failure occurs according to an embodiment of the present disclosure.
It can be seen from fig. 4 that after an alarm occurs to a service, hosts associated with the failed service may be aggregated, interfaces associated with the failed service may be aggregated, and then traffic analysis, that is, service function and service scope analysis in the embodiment of the present application, may be performed on the associated hosts and the associated interfaces. Because the APM probe is embedded in the newly developed interface, the APM probe-embedded interface is subjected to APM analysis, namely, the association relationship between the newly developed interface and the service is obtained in an APM probe form, and the interface is analyzed. And acquiring the association information of the interfaces on the host and the interfaces on the peripheral hosts for the old interfaces in a network form. Determining a fault influence range after the correlation host business analysis, the correlation interface business analysis and the APM interface correlation analysis, determining a fault grade according to the fault influence range, and generating a fault processing action; and informing an on-duty engineer and informing a core department of the fault influence level and the fault processing action. And the on-duty engineer or the core department creates an accident recording module according to the fault response level and the fault processing action, records historical fault data, and responds to accident processing according to the processing action.
By adopting the method provided by the embodiment, the speed and time for fault processing after monitoring the service are greatly improved. Referring to fig. 5, fig. 5 is a schematic diagram of a fault processing speed and time after the method provided by the embodiment of the present application is adopted. Wherein in fig. 5, the abscissa is the ID name of each service generating a fault, and the ordinate identifies the time taken to process each accident; the fault ID naming convention is time + product line name to distinguish the uniqueness of the fault. As an example, the abscissa represents from 2019 month 2 to 2019 month 8, and the time for processing the fault is in a downward trend, which shows that the effect of processing the service fault by using the method provided by the embodiment of the present application is better and better.
In another aspect, a service failure monitoring system is provided, the system comprising: memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement any of the methods described above.
In yet another aspect, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when executed by a processor, the computer-executable instructions are used for implementing any one of the above-mentioned service failure monitoring methods.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A service failure monitoring method is characterized by comprising the following steps:
acquiring information of service fault alarm;
acquiring the association information between the host and the peripheral host;
acquiring the associated information of the host upper interface and the peripheral host upper interface;
according to the information of the service fault alarm, the association information between the host and the association information of the host upper interface and the peripheral host upper interface, aggregating the host associated with the service and the interface associated with the service, and then performing service analysis on the aggregated host associated with the service and the aggregated interface associated with the service to obtain a first analysis result, wherein the first analysis result comprises a host service function and a service range;
traversing and analyzing the host upper interface and the peripheral host upper interface according to the service failure alarm information and the correlation information of the host upper interface and the peripheral host upper interface to obtain a second analysis result, wherein the second analysis result comprises the service function and the service range of the service interface, and a target failure interface is obtained in the traversing and analyzing, and an interface which is positioned before the service failure time and has a preposed causal relationship with the service failure is taken as the target failure interface; the service is in one-to-one correspondence with the interface on the host;
defining a dictionary base of influence ranges of host and service interface modules, wherein each host and each service interface respectively corresponds to a description of a service function module and the influence range;
and determining a fault influence range according to the first analysis result and the second analysis result.
2. The method for service failure monitoring of claim 1 wherein said analyzing traversal of the host interface and the peripheral host interface comprises:
acquiring the weight between the host upper interface and the peripheral host upper interface;
sorting the weights in a descending order;
and traversing and analyzing the host upper interface and the peripheral host upper interfaces according to the sequence of the weights from large to small.
3. The method for monitoring service failure according to claim 2, wherein the step of performing traversal analysis on the host upper interface and the peripheral host upper interfaces according to the order of decreasing weights comprises:
acquiring a service function and a service range of a fault service;
acquiring service functions and service ranges of an upper interface of a host with a fault and upper interfaces of peripheral hosts;
and traversing and analyzing according to the service function and service range of the failed service and the service functions and service ranges of the upper interfaces of the failed host and the upper interfaces of the peripheral hosts in the order from large to small according to the weight.
4. The method for monitoring service failure according to claim 3, wherein said traversing analysis is performed according to the service function and service range of the failed service and the service functions and service ranges of the failed host upper interface and the peripheral host upper interfaces in the order from large to small according to the weight, and includes:
comparing the service function and the service range of the fault service, and the service functions and the service ranges of the upper interface of the fault service host and the upper interfaces of the peripheral hosts to obtain a comparison result;
acquiring a time sequence of the service fault according to the comparison result;
acquiring the time sequence weight of the failure of the host upper interface and the peripheral host upper interface;
and traversing and analyzing the host upper interface and the peripheral host upper interface according to the time sequence of the service failure and the time sequence weight of the host upper interface and the peripheral host upper interface failure.
5. The service failure monitoring method of claim 4, wherein the obtaining the time series of the service failure according to the comparison result comprises:
and when the service function and the service range of the service with the fault are the same as those of the interfaces on the host with the fault and the interfaces on the peripheral hosts, acquiring the time sequence of the service with the fault.
6. The method of claim 4, wherein the analyzing the traversal of the host interface and the peripheral host interface according to the time series of the service failures and the time series of the host interface and the peripheral host interface failures comprises:
comparing the time sequence of the service failure with the time sequences of the host upper interface and the peripheral host upper interface failure to obtain a comparison result;
and obtaining a target fault interface according to the comparison result.
7. The method of claim 6, wherein the obtaining the target failure interface according to the comparison result comprises:
and sequencing the interfaces which are positioned before the service failure time and have a preposed causal relationship with the service failure according to the time sequence weights from large to small, and taking the interface with the largest weight as a target failure interface.
8. The service fault monitoring method of any one of claims 1-7, wherein after determining a fault impact range from the first analysis result and the second analysis result, the method further comprises: determining the fault influence type and the influence level according to the service function and the service range of the host and the service functions and the service ranges of the host upper interface and the peripheral host upper interfaces;
and generating a fault processing action according to the fault influence type and the influence level.
9. A service fault monitoring system, the system comprising: memory, processor and computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any of claims 1-8.
10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the service failure monitoring method of any one of claims 1-8.
CN202110241842.6A 2021-03-04 2021-03-04 Service fault monitoring method, system and computer readable storage medium Active CN113037550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110241842.6A CN113037550B (en) 2021-03-04 2021-03-04 Service fault monitoring method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110241842.6A CN113037550B (en) 2021-03-04 2021-03-04 Service fault monitoring method, system and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113037550A CN113037550A (en) 2021-06-25
CN113037550B true CN113037550B (en) 2022-07-26

Family

ID=76467654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110241842.6A Active CN113037550B (en) 2021-03-04 2021-03-04 Service fault monitoring method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113037550B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013058818A (en) * 2011-09-06 2013-03-28 Fujitsu Ltd Monitoring auxiliary device, monitoring auxiliary method, and monitoring auxiliary program
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013058818A (en) * 2011-09-06 2013-03-28 Fujitsu Ltd Monitoring auxiliary device, monitoring auxiliary method, and monitoring auxiliary program
CN108833184A (en) * 2018-06-29 2018-11-16 腾讯科技(深圳)有限公司 Service fault localization method, device, computer equipment and storage medium
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium

Also Published As

Publication number Publication date
CN113037550A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN112162878B (en) Database fault discovery method and device, electronic equipment and storage medium
CN111209131A (en) Method and system for determining fault of heterogeneous system based on machine learning
US9298525B2 (en) Adaptive fault diagnosis
CN111176879A (en) Fault repairing method and device for equipment
CN111756582B (en) Service chain monitoring method based on NFV log alarm
CN108599977B (en) System and method for monitoring system availability based on statistical method
CN109034423B (en) Fault early warning judgment method, device, equipment and storage medium
US20240039821A1 (en) Mitigating failure in request handling
CN116719664B (en) Application and cloud platform cross-layer fault analysis method and system based on micro-service deployment
WO2020236358A1 (en) Techniques for correlating service events in computer network diagnostics
CN114255784A (en) Substation equipment fault diagnosis method based on voiceprint recognition and related device
CN106951360B (en) Data statistical integrity calculation method and system
CN116010456A (en) Equipment processing method, server and rail transit system
CN116755992B (en) Log analysis method and system based on OpenStack cloud computing
US8949669B1 (en) Error detection, correction and triage of a storage array errors
CN113037550B (en) Service fault monitoring method, system and computer readable storage medium
CN114500178B (en) Self-operation intelligent Internet of things gateway
CN114531338A (en) Monitoring alarm and tracing method and system based on call chain data
CN114881112A (en) System anomaly detection method, device, equipment and medium
CN111835566A (en) System fault management method, device and system
ChuahM et al. Failure diagnosis for cluster systems using partial correlations
CN111813872B (en) Method, device and equipment for generating fault troubleshooting model
AU2014200806B1 (en) Adaptive fault diagnosis
CN116204386B (en) Method, system, medium and equipment for automatically identifying and monitoring application service relationship
Jagannathan et al. REFORM: Increase alerts value using data driven approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant