CN113596078A

CN113596078A - Service problem positioning method and device

Info

Publication number: CN113596078A
Application number: CN202110671110.0A
Authority: CN
Inventors: 单戈
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-11-02

Abstract

The application discloses a service problem positioning method and device, which can timely and accurately position problems existing in a service processing process from the perspective of global operation. The method comprises the following steps: acquiring link tracking data, wherein the link tracking data comprises execution fragment information of a service request of a target service when each service node is called; based on the parent-child relationship of each piece of execution fragment information in the link tracing data, establishing a node calling relationship graph corresponding to the service request, wherein the node calling relationship graph carries the calling relationship of the service request among service nodes and the execution time consumption of the service request when the service node calls; determining a problem service node corresponding to the target service based on the difference between the node calling relationship graph corresponding to the service request and a preset expected calling relationship graph; and determining the problem root of the problem service node based on the attribute label contained in the execution fragment information of the service request when the problem service node is called.

Description

Service problem positioning method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for business problem.

Background

A distributed system is a system of computer nodes that communicate over a network and that work in concert to accomplish a common task. Most of the existing business processing adopts a distributed system architecture, and different teams are responsible for finding and positioning business problems in time aiming at different service nodes of the same business, so that the method is an important link for ensuring the business availability.

At present, the traditional business problem positioning mode mainly comprises active detection of interfaces, monitoring of each service node, monitoring of abnormal flow and the like, each operation and maintenance team only monitors data related to the service in charge of the team, and problems are solved by analyzing the monitored data. Therefore, each operation and maintenance team is limited by the independence of the acquired monitoring data, the whole condition of the service cannot be known, and the service problem cannot be accurately found and positioned.

Therefore, a solution capable of timely and accurately positioning the service problem from a global perspective is needed.

Disclosure of Invention

The embodiment of the application provides a method and a device for positioning service problems, which can timely and accurately position the problems existing in the service processing process from the perspective of global operation.

In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application:

in a first aspect, an embodiment of the present application provides a method for locating a service problem, including:

acquiring link tracking data, wherein the link tracking data comprises execution fragment information of a service request of a target service when each service node is called, and the execution fragment information comprises an identifier and an attribute tag of the service request, a timestamp called by the service node, an execution fragment identifier and a parent execution fragment identifier;

establishing a node calling relation graph corresponding to the service request based on the parent-child relation of each piece of execution fragment information in the link tracking data, wherein the node calling relation graph carries the calling relation of the service request among service nodes and the execution time consumption of the service request when the service node calls;

determining a problem service node corresponding to the target service based on the difference between a node calling relation graph corresponding to the service request and a preset expected calling relation graph;

and determining the problem root of the problem service node based on the attribute label contained in the execution fragment information of the service request when the problem service node is called.

In a second aspect, an embodiment of the present application provides a service problem positioning apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring link tracking data, the link tracking data comprises execution fragment information of a service request of a target service when each service node is called, and the execution fragment information comprises an identifier and an attribute tag of the service request, a timestamp called by the service node, an execution fragment identifier and a parent execution fragment identifier;

the building unit is used for building a node calling relation graph corresponding to the service request based on the parent-child relation of each piece of execution fragment information in the link tracking data, and the node calling relation graph carries the calling relation of the service request among the service nodes and the execution time consumption of the service request when the service node is called;

the problem node determining unit is used for determining a problem service node corresponding to the target service based on the difference between the node calling relation graph corresponding to the service request and a preset expected calling relation graph;

and the root cause analysis unit is used for determining the problem root cause of the problem service node based on the attribute label contained in the execution fragment information of the service request when the problem service node is called.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing one or more programs that, when executed by an electronic device that includes a plurality of application programs, cause the electronic device to:

According to the at least one technical scheme adopted by the embodiment of the application, because the execution fragment information in the link tracing data can reflect the end-to-end association relationship and has the global property, the node calling relationship corresponding to the service request is established based on the parent-child relationship of each execution fragment information in the link tracing data, the global property of the link tracing data can be fully utilized, and the node calling relationship graph capable of reflecting the execution condition of the service request under the distributed system architecture in the global direction is established; because the preset expected calling relation graph can reflect the calling relation of the service request among the service nodes under normal conditions and the execution time consumption of the service request when the service node is called globally, the problem service node corresponding to the target service can be accurately determined based on the difference between the node calling relation graph corresponding to the service node and the expected calling relation graph; the attribute label contained in the execution fragment information of the service request when the problem service node is called can reflect the execution condition of the service node when the service node is called, and further based on the attribute label, the problem root of the problem service node can be accurately positioned, so that the problem existing in the service processing process can be timely and accurately positioned from the perspective of global operation. Moreover, the whole analysis and positioning process does not need manual participation, the automation degree of service problem positioning is improved, and the problem positioning efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1A is a schematic diagram of an implementation environment of a service problem location method according to an embodiment of the present application;

fig. 1B is a schematic diagram of an implementation environment of a service problem location method according to another embodiment of the present application;

fig. 2 is a schematic flowchart of a service problem location method according to an embodiment of the present application;

FIG. 3 is a node call relationship diagram provided by an embodiment of the present application;

fig. 4 is a schematic diagram of a graph for merging call relations of different nodes according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a first execution time and a second execution time at different times according to an embodiment of the present application;

fig. 6 is a schematic diagram of a clustering result according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a service problem locating device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As described above, the conventional service problem location method mainly includes active detection of interfaces, monitoring of each service node, monitoring of abnormal traffic, and the like, each operation and maintenance team only monitors data related to the service in charge of the team, and issues are checked by analyzing the monitored data. Therefore, each operation and maintenance team is limited by the independence of the acquired monitoring data, the whole condition of the service cannot be known, and the service problem cannot be accurately found and positioned. For example, if a certain operation and maintenance team monitors that data related to own business is normal, the business is considered to be normal.

Therefore, the embodiment of the application aims to provide a scheme capable of timely and accurately positioning the service problem from the perspective of global operation.

It should be understood that the service problem location method provided by the embodiments of the present specification may be executed by an electronic device or software installed in the electronic device, and specifically may be executed by a server device.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

For convenience of understanding, the following briefly introduces an implementation environment to which the service problem location method provided in the embodiments of the present specification is applicable. Referring to fig. 1A and fig. 1B, an implementation environment related to an embodiment of the present disclosure includes a data transmission middleware, a log processing module, a storage module, a log processing module, an analysis module, and a presentation module.

The data transmission middleware comprises a log uploading channel log-Kafka and a link tracking data uploading channel Span-Kafka, wherein the log-Kafka can be used for a service provider (such as a client, a MAPI (mapping information indicator), a MPS (media server), a Nginx and the like) of non-Java engineering to upload a log of the service provider by a log access mode, and the Span-Kafka can be used for a service provider (such as a message platform, a live broadcast platform, a video platform, an open platform and the like) of the Java engineering to upload execution fragment information Span of each responsible service node when the service node is called by an SDK access mode.

The storage module may include a plurality of storage sub-modules, including, for example, but not limited to, Cassandra, ElasticSearch, graph, etc., different storage sub-modules may be used to store different data. For example, Cassandra may store original execution fragment information, ElasticSearch may store an index identification of the execution fragment information, and graph may store time-sequential data. Of course, as shown in FIG. 1B, the storage module may also include any suitable type of storage sub-module, such as Redis.

The log processing module comprises a sub-module log processor for processing the log and a sub-module Span processor for processing the execution fragment information, wherein the log processor can read the log from the log-Kafka and perform format conversion to obtain the execution fragment information Span, and then the execution fragment information obtained by conversion is sent to the Span-Kafka. The Span processor can read the execution fragment information from the Span-Kafka and store the execution fragment information in the storage module, and specifically can store the execution fragment information in the submodule Cassandra and store the index identifier of the execution fragment information in the storage submodule ElasticSearch.

The analysis module can perform streaming data processing on the execution fragment information to analyze and locate problems existing in the service processing process, and store the intermediate result and the final analysis result into the storage module. In particular, the analysis module may comprise a sub-module eva-flink for processing the chronology data and a sub-module eva-detect for analysis and problem localization. The eva-flink can read execution fragment information of the current service request aiming at the same service when each service node is called from the Span-Kafka, and establishes a node calling relationship graph corresponding to the service request and stores the node calling relationship graph established by the node calling relationship graph merging piece into the storage module through the parent-child relationship of each execution fragment information, and specifically can store the node calling relationship graph into the storage sub-module graph. Of course, the eva-flink may also provide monitoring for various types of services, such as IP services, experience services (e.g., microblog services), and grayscale services.

The eva-detect can periodically read a node call relation graph corresponding to a service request of the same service from the graph, determine an abnormal problem service node and further analyze a problem root of the problem service node by analyzing the read node call relation graph, and then store an analysis result into an ElasticSearch.

The display module may include a front-end page eva-web and an interface eva-api, and corresponding data may be queried from the storage module by calling the interface eva-api, and displayed through the front-end page eva-web, so that relevant personnel including a service provider, a user, an operation and maintenance party, etc. can timely know service execution conditions and existing problems.

Based on the implementation environment, the embodiment of the application provides a service problem positioning method. Referring to fig. 2, a schematic flow chart of a service problem location method according to an embodiment of the present application is shown, where the method includes:

s202, link tracing data is obtained, and the link tracing data comprises execution fragment information of the service request of the target service when each service node is called.

In the embodiment of the application, the service request is at the service node.

The service request for the target service is used for requesting to acquire the target service. Each time a user requests to obtain a target service, a service request is generated. For example, the target service is sending a private letter, and a private letter sending request is generated every time the user requests to send the private letter.

In the embodiment of the present application, the number of the obtained service requests for the target service may be one, or may also be multiple. In order to timely find out the problem existing in the target service processing process, in specific implementation, the link tracking data at the current moment can be obtained at regular time according to a preset time interval.

Under the distributed system architecture, each service request can be divided into a plurality of subtasks for processing, and each subtask is executed by calling a corresponding service node. When each service node is called, a single service request generates and reports an executive fragment message related to the service request, where the executive fragment message may include, but is not limited to, an identifier (TraceId) and an attribute tag of the service request, a timestamp called by the service node, an executive fragment identifier (SpanId) and a parent executive fragment identifier (parent id), and the like.

Wherein the identification of the service request (TraceId) is used to identify the service request to which all execution fragments of a call procedure belong. The TraceId is carried in the service request and in turn passed on to the downstream serving node.

The attribute label of the service request is used for representing the attribute of the service request, and when different service nodes are called, the obtained execution fragment information contains different attribute labels. For example, when the service node, which is the client, is called, the obtained attribute tags included in the execution segment may include attribute values corresponding to the service request in multiple attribute dimensions, such as a network type, a terminal system, an application version, a terminal type, a region to which the service request belongs, an operator to which the service request belongs, and a return code.

The timestamp of the service node invocation is used to indicate the start-stop time of the service node invocation. Thus, based on the timestamp of the service node call, the execution time of the service request at the time of the service node call can be determined.

The execution fragment identification at the time of service node invocation is used to uniquely identify the service node. The parent execution fragment when a service node is invoked identifies the parent service node that uniquely identifies the service node. The parent service node of a service node refers to the service node whose calling order precedes the service node.

Specifically, each service node, when invoked, generates an identifier for identifying the service node, i.e., an execution fragment identifier, for the service request, and the execution fragment identifier is transmitted to a downstream service node along with the service request. For each service node, the execution fragment identification of the upstream service node, which is passed along with the service request from the upstream service node, is recorded as the parent execution fragment identification of the current service node. Thus, the execution fragment identifier and the parent execution fragment identifier included in the execution fragment information at the time of calling the service node can reflect the association relationship between the service nodes.

Of course, it should be understood that the execution fragment information when the service node calls may also include other information related to the service request, for example, a name of the execution method, an identity of the initiator, and the like, which may be increased or decreased specifically according to actual needs, and this is not specifically limited in this embodiment of the application.

In order to enable different service providers to conveniently obtain the link tracking data, the link tracking data can be obtained by one or more of the following ways.

Mode 1: and collecting the running logs of each service node by executing a pre-configured log program, and extracting the execution fragment information of the service request aiming at the target service when each service node calls from the collected running logs.

For example, for service providers (such as the client, the MAPI, the MPS, and the Nginx shown in fig. 1A) using multiple programming languages, such as a client, a web page, a PHP front end, etc., the above log programs may be deployed in respective servers, and link tracking data is obtained in this manner.

Mode 2: introducing a pre-configured Software Development Kit (SDK) into a pre-created monitoring program through a predetermined package management tool and executing the monitoring program to acquire execution fragment information of a service request when each service node is called.

The package management tool may include one or more of Maven, Classpath, and the like.

For example, for a service provider using Java language (such as a message platform, a live platform, a video platform, and an open platform shown in fig. 1A), the link trace data can be obtained in this manner 2.

It can be understood that in this way, the service provider can access the distributed system without modifying the code, thereby obtaining the link tracking data.

And S204, establishing a node calling relation graph corresponding to the service request based on the parent-child relation of each execution fragment information in the link tracking data.

The node calling relationship graph corresponding to the service request carries the calling relationship of the service request among the service nodes and the execution time of the service request when the service node calls.

As an optional implementation manner, the establishing of the node call relationship graph corresponding to the service request may include the following steps:

step a1, determining the parent-child relationship of each piece of execution segment information belonging to the service request based on the service request identifier, the execution segment identifier and the parent execution segment identifier contained in each piece of execution segment information in the link trace data.

Specifically, the execution segment information including the identifier of the same service request may be determined as the execution segment information belonging to the same service request, and the parent-child relationship of each execution segment information may be determined based on the execution segment identifier and the parent execution segment identifier included in the execution segment information.

Step A2, based on the parent-child relationship, determining the calling relationship of the service request between the service nodes.

The parent-child relationship between different pieces of execution segment information can reflect the upstream and downstream relationship of the service nodes generating different pieces of execution segment information, and the calling relationship of the service request between the service nodes can be determined based on the upstream and downstream relationship. For example, if the execution fragment information a is parent execution fragment information of the execution fragment information B, it may be determined that the service node generating the execution fragment information a is invoked before the service node generating the execution fragment information B.

Step A3, determining the execution time of the service request when the service node calls based on the timestamp of the service request calling at the service node.

Specifically, the timestamp of the service request at the time of invocation at the service node is used to indicate the start-stop time of the service request at the time of invocation at the service node, based on which the execution time of the service request at the time of invocation at the service node can be determined.

Step A4, based on the calling relationship between service nodes of the service request and the execution time consumption when the service nodes are called, a node calling relationship graph corresponding to the service request is established.

Specifically, a node calling relationship tree for characterizing calling relationships between service nodes may be generated based on the calling relationships between service nodes of a service request, where nodes in the node calling relationship tree characterize service nodes required to be called for processing the service request, and connection relationships between nodes in the node calling relationship tree characterize calling relationships between service nodes. Further, taking the execution time consumption as an abscissa, based on the generated node call relation tree and the execution time consumption of the service request when each service node is called, a corresponding node call relation graph can be created. For example, fig. 3 shows an example of a node call relationship diagram, where a vertical axis direction identifies a call relationship between a plurality of service nodes carrying client-sent messages, load balancing scheduling, message service, authentication service, user service, cache and database service, and message queue service, and the horizontal axis direction represents time consumption for execution.

It should be noted that, the above-mentioned process is only a generation process of a node call relation graph corresponding to one service request, and if the number of service requests for the target service is multiple, the node call relation graph corresponding to the service request may be established in the above-mentioned manner for each service request.

Secondly, the created node call relation graph can also carry the start and stop time of each service node during calling. Therefore, the created node call relation graph can reflect the call timing sequence among the service nodes more intuitively.

In addition, in addition to the node call relationship graph in the form shown in fig. 3, the form of the node call relationship graph in the embodiment of the present application may be any other suitable form, such as a directed graph, and the form of the node call relationship graph is not specifically limited in the embodiment of the present application.

The node call relation graph corresponding to the service request established by the embodiment can accurately and intuitively reflect the call relation of the service request between the service nodes and the execution time consumption of the service request during calling of the service nodes, so that the execution condition of the service request under the distributed system architecture can be globally and intuitively reflected, and powerful data support can be provided for subsequent discovery and positioning problems.

Of course, other methods commonly used in the art may also be used to establish the node call relationship diagram corresponding to the service request, and the embodiments of the present application are not expanded in detail here.

S206, based on the difference between the node calling relation graph corresponding to the service request and the preset expected calling relation graph, determining the problem service node corresponding to the target service.

The expected call relation graph can be generated based on the call relation between service nodes of the normally processed service request for the target service and the execution time consumption of the service node during calling, or the execution conditions of the service request in different periods are different, and the expected call relation graph can also be generated based on historical link tracing data in a predetermined time period before the current time. The specific manner of generating the expected call relationship graph is similar to the manner of generating the node call relationship graph corresponding to the service request, and is not described herein again.

The call relation graph is expected to be used for representing the call relation of the service request among the service nodes under normal conditions and the execution time consumption when the service node calls. Based on the method, the calling relation graph of the node corresponding to the service request is compared with the expected calling relation graph, and the abnormal problem service node can be determined by analyzing the difference between the calling relation graph and the expected calling relation graph. For example, if a certain service node exists only in the node call relationship diagram corresponding to the service request or only in the expected call relationship diagram, or the service node has a larger difference between the execution time consumption corresponding to the service node in the node call relationship diagram corresponding to the service request and the execution time consumption corresponding to the expected call relationship diagram, it may be determined that the service node is abnormal.

As an embodiment, if the number of the service requests for the target service is one, determining the problem service node corresponding to the target service includes: and acquiring a difference value between the execution time consumption of the service node corresponding to the expected calling relation graph and the execution time consumption of the service node corresponding to the node calling relation graph aiming at a single service node in the expected calling relation graph, and if the difference value exceeds a preset threshold value and the execution time consumption of the service node corresponding to the node calling relation graph exceeds the execution time consumption corresponding to the expected calling relation graph, determining that the service node is a problem service node.

If the number of the service requests for the target service is multiple, considering that there may be a difference in execution of different service requests for the same service, in order to determine the problem service node more accurately, determining the problem service node corresponding to the target service may include the following steps:

and step B1, merging the node call relational graphs corresponding to different service requests of the target service.

Specifically, the merging of call relationship graphs of different nodes refers to time-consuming merging of service nodes and corresponding execution of the service nodes in different reception call relationships. In an embodiment, an average value of corresponding execution time consumptions of the same service node in the node call relationship graphs corresponding to different service requests for the target service may be determined, and the average value is used as the corresponding execution time consumption of the same service node in the merged node call relationship graph.

In another embodiment, an average value of start times of the same service node in different node call relationship graphs may be used as a corresponding start time of the service node in the merged node call relationship graph, and an average value of end times of the service node in different node call relationship graphs may be used as a corresponding end time of the service node in the merged node call relationship graph, so that it may be determined that the service node consumes time for corresponding execution in the merged node call relationship graph.

For example, if the service node a does not exist in the node call relationship diagram 1 (that is, the execution time is 0), the corresponding execution time in the node call relationship diagram 3 is 50ms, and the corresponding execution time in the node call relationship diagram 3 is 40ms, it may be determined that the corresponding execution time of the service node in the merged node call relationship diagram is 30 ms. For another example, as shown in fig. 4, the merged node call relationship graph shown on the right side of fig. 4 may be obtained by merging the different node call relationship graphs on the left side of fig. 4 in the above manner.

Of course, it can be understood that other technical means commonly used in the art may also be adopted to merge the node call relationship graphs corresponding to different service requests.

And step B2, acquiring the difference value between the first execution time consumption and the second execution time consumption of the single service node aiming at the single service node in the expected calling relationship graph.

The first execution time consumption is the corresponding execution time consumption of a single service node in the expected calling relationship graph, and the second execution time consumption is the corresponding execution time consumption of the single service node in the combined node calling relationship graph.

And step B3, if the first execution time consumption is less than the second execution time consumption and the difference value exceeds a preset threshold value, determining that the single service node is the problem service node.

For example, as shown in fig. 5, at 18: at the time 39, the difference between the first execution time (shown by a solid line in the figure) and the second execution time (shown by a dashed line in the figure) of the service node of the client exceeds the predetermined threshold, and the first execution time is less than the second execution time, the client may be determined as the problem service node.

S208, based on the attribute label contained in the execution segment information when the service request is called by the problem service node, the problem root of the problem service node is determined.

The attribute label contained in the execution fragment information of the service request when the problem service node is called can reflect the execution condition of the service node when the service node is called, so the problem root of the problem service node can be determined by analyzing the attribute label.

Considering that when the service requests for the target service are highly concurrent, the service requests with abnormal processing procedures have commonality at the problem service node, and thus, as an alternative implementation, the problem root cause of the problem service node can be analyzed based on the clustering result by clustering the service requests.

Specifically, determining the problem root cause of the problem service node may include the steps of:

and step C1, clustering different service requests based on the attribute labels contained in the execution fragment information of the different service requests of the target service when the problem service node is called, so as to obtain a plurality of service request cluster clusters.

In specific implementation, the clustering process is performed on different service requests, and may be implemented by using any appropriate clustering algorithm, such as a K-means algorithm, a graph group detection algorithm, and the like, which is not specifically limited in this embodiment of the present application.

And step C2, determining the problem root of the problem service node based on the attribute labels of the service requests respectively contained in the service request cluster when the problem service node is called.

The service request cluster with the largest number of service requests is usually the largest in commonality, so that the general processing condition of the problem service node on the service requests can be reflected more accurately, and the problem root of the problem service node can be analyzed based on the attribute label of the service request in the service request cluster with the largest number of service requests when the problem service node is called. For example, taking the problem service node as the client, if the attribute tag of most of the service requests in the service request cluster containing the largest number of service requests when the problem service node calls indicates that the error code returned by the problem service node is 7004_ network unavailable, it may be determined that the problem root of the problem service node is because the network of the user terminal of most of the users is unavailable. In order to further improve the accuracy and reliability of the root cause analysis result, the attribute tag may include attribute values corresponding to a plurality of attribute dimensions. Correspondingly, firstly, a target cluster can be determined, wherein the target cluster is a service request cluster containing the largest number of service requests in the service request clusters; then, aiming at a single attribute dimension, grouping the service requests in the target cluster based on the attribute values of the service requests in the target cluster in the single attribute dimension respectively to obtain the service request occupation ratios corresponding to different attribute values of the target cluster in the single attribute dimension respectively; further, determining the attribute value with the highest ratio of the corresponding service request as an abnormal attribute value corresponding to a single attribute dimension; and finally, determining the problem root of the problem node based on the abnormal attribute values respectively corresponding to the attribute dimensions.

For example, as shown in fig. 6, if the problem service node is a client, the multiple attribute dimensions corresponding to the problem service node include a region to which an initiator of the service request belongs, an operator, a used network, a system, a return code, a version, execution time consumption, and the like. The number of the service requests for the target service is 1000, and 6 service request cluster clusters can be obtained by clustering the service requests. The service requests in the service request cluster with the largest number of service requests are grouped by the above mode, and the attribute value of the service request in the cluster with the highest ratio under each attribute dimension is obtained, so that the problem root of the problem service node can be determined as follows: a network using an initiator of the unicom 4G network and the mobile wifi network, which is located in the guangdong region and has an application version of 971, is not available.

Further, in order to facilitate that relevant personnel can timely know the execution condition of the target service, after the problem root of the problem service node is determined, prompt information can be output based on the problem root. For example, the prompting information for prompting the question root of the question service node may be presented through the front end page. Of course, intermediate data in the analysis process can also be shown, such as the created node call relation graph, the execution fragment information of the service requests in each service request cluster when the problem service node is called, and the like.

According to the service problem positioning method provided by the embodiment of the application, because the execution fragment information in the link tracing data can reflect the end-to-end association relationship and has the global property, the node calling relationship corresponding to the service request is established based on the parent-child relationship of each execution fragment information in the link tracing data, the global property of the link tracing data can be fully utilized, and the node calling relationship graph capable of reflecting the execution condition of the service request under the distributed system architecture in the global direction is established; because the preset expected calling relation graph can reflect the calling relation of the service request among the service nodes under normal conditions and the execution time consumption of the service request when the service node is called globally, the problem service node corresponding to the target service can be accurately determined based on the difference between the node calling relation graph corresponding to the service node and the expected calling relation graph; the attribute label contained in the execution fragment information of the service request when the problem service node is called can reflect the execution condition of the service node when the service node is called, and further based on the attribute label, the problem root of the problem service node can be accurately positioned, so that the problem existing in the service processing process can be timely and accurately positioned from the perspective of global operation. Moreover, the whole analysis and positioning process does not need manual participation, the automation degree of service problem positioning is improved, and the problem positioning efficiency is improved.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subject of step 202 and step 204 may be device 1, and the execution subject of step 206 and step 208 may be device 2; for another example, the execution subject of step 202 may be device 1, and the execution subjects of steps 202 to 208 may be device 2; and so on.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 7, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 7, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the service problem positioning device on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The method executed by the service problem locating device according to the embodiment shown in fig. 2 of the present application may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may also execute the method in fig. 2 and implement the function of the service problem positioning apparatus in the embodiment shown in fig. 2, which is not described herein again in this embodiment of the present application.

Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a portable electronic device including a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 2, and are specifically configured to:

Fig. 8 is a schematic structural diagram of a service problem locating device according to an embodiment of the present application. Referring to fig. 8, in a software implementation, the service problem locating device 800 may include:

an obtaining unit 810, configured to obtain link trace data, where the link trace data includes execution fragment information of a service request of a target service when each service node is called, and the execution fragment information includes an identifier and an attribute tag of the service request, a timestamp called by the service node, an execution fragment identifier, and a parent execution fragment identifier;

a building unit 820, which builds a node call relation graph corresponding to the service request based on the parent-child relationship of each piece of execution segment information in the link tracing data, where the node call relation graph carries the call relation of the service request between service nodes and the execution time consumption of the service request when the service node is called;

a problem node determining unit 830, configured to determine a problem service node corresponding to the target service based on a difference between a node call relationship graph corresponding to the service request and a predetermined expected call relationship graph;

the root cause analysis unit 840 determines the problem root cause of the problem service node based on the attribute tag included in the execution fragment information of the service request when the problem service node is called.

Optionally, the constructing unit 820 is specifically configured to:

determining a parent-child relationship of each piece of execution segment information belonging to the service request based on the service request identifier, the execution segment identifier and the parent execution segment identifier contained in each piece of execution segment information in the link tracing data;

determining the calling relation of the service request among service nodes based on the parent-child relation;

determining the execution time consumption of the service request when the service request is called at a service node based on the timestamp called by the service request at the service node;

and establishing a node calling relationship graph corresponding to the service request based on the calling relationship of the service request among the service nodes and the execution time consumption of the service request during calling of the service nodes.

Optionally, the problem node determining unit 830 is specifically configured to:

merging the node call relational graphs corresponding to different service requests aiming at the target service;

acquiring a difference value between first execution time consumption and second execution time consumption of a single service node aiming at the single service node in the expected calling relationship graph, wherein the first execution time consumption is the corresponding execution time consumption of the single service node in the expected calling relationship graph, and the second execution time consumption is the corresponding execution time consumption of the single service node in the combined node calling relationship graph;

and if the first execution time consumption is less than the second execution time consumption and the difference value exceeds a preset threshold value, determining that the single service node is a problem service node.

and determining the average value of the corresponding execution time consumption of the same service node in the node call relation graphs corresponding to different service requests of the target service respectively, wherein the average value is used as the corresponding execution time consumption of the same service node in the node call relation graphs after the same service node is combined.

Optionally, the root cause analysis unit 840 is specifically configured to:

clustering different service requests of the target service based on attribute labels contained in execution fragment information of the different service requests when the problem service node is called to obtain a plurality of service request clustering clusters;

and determining the problem root of the problem service node based on the attribute labels of the service requests respectively contained in the service request cluster when the problem service node is called.

Optionally, the attribute tag includes attribute values corresponding to a plurality of attribute dimensions;

the root cause analysis unit 840 is specifically configured to:

determining a target cluster, wherein the target cluster is a service request cluster containing the largest number of service requests in the service request clusters;

aiming at a single attribute dimension, grouping the service requests in the target cluster based on the attribute values of the service requests in the target cluster in the single attribute dimension respectively to obtain the service request occupation ratios corresponding to different attribute values of the target cluster in the single attribute dimension respectively;

determining the attribute value with the highest ratio of the corresponding service request as an abnormal attribute value corresponding to the single attribute dimension;

and determining the problem root of the problem node based on the abnormal attribute values respectively corresponding to the attribute dimensions.

Optionally, the obtaining unit 810 is specifically configured to:

collecting running logs of each service node by executing a pre-configured log program, and extracting execution fragment information of the service request when each service node is called from the collected running logs; and/or the presence of a gas in the gas,

introducing a pre-configured Software Development Kit (SDK) into a pre-created monitoring program through a preset package management tool, and executing the monitoring program to acquire execution fragment information of the service request when each service node is called.

According to the service problem positioning device provided by the embodiment of the application, because the execution fragment information in the link tracing data can reflect the end-to-end incidence relation and has the global property, the node calling relation corresponding to the service request is established based on the parent-child relation of each execution fragment information in the link tracing data, the global property of the link tracing data can be fully utilized, and the node calling relation graph capable of reflecting the execution condition of the service request under the distributed system architecture in the global state is established; because the preset expected calling relation graph can reflect the calling relation of the service request among the service nodes under normal conditions and the execution time consumption of the service request when the service node is called globally, the problem service node corresponding to the target service can be accurately determined based on the difference between the node calling relation graph corresponding to the service node and the expected calling relation graph; the attribute label contained in the execution fragment information of the service request when the problem service node is called can reflect the execution condition of the service node when the service node is called, and further based on the attribute label, the problem root of the problem service node can be accurately positioned, so that the problem existing in the service processing process can be timely and accurately positioned from the perspective of global operation. Moreover, the whole analysis and positioning process does not need manual participation, the automation degree of service problem positioning is improved, and the problem positioning efficiency is improved.

In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A method for locating a business problem, comprising:

2. The method according to claim 1, wherein the establishing a node call relationship graph corresponding to a service request based on a parent-child relationship of each piece of execution fragment information in the link trace data includes:

3. The method of claim 1, wherein the determining the problem service node corresponding to the target service based on the difference between the node call relation graph corresponding to the service request and a predetermined expected call relation graph comprises:

4. The method according to claim 3, wherein the merging the node call relationship graphs corresponding to different service requests for the target service comprises:

5. The method of claim 1, wherein the determining the problem root cause of the problem service node based on an attribute tag included in execution fragment information of the service request when the problem service node is invoked comprises:

6. The method of claim 5, wherein the attribute tag comprises attribute values corresponding to a plurality of attribute dimensions;

the determining a problem root cause of the problem node based on the attribute labels of the service requests respectively contained in the plurality of service request cluster when the problem service node is called includes:

7. The method of any one of claims 1 to 6, wherein said obtaining link trace data comprises:

8. A business problem locating apparatus, comprising:

9. An electronic device, comprising:

a processor; and

10. A computer-readable storage medium storing one or more programs that, when executed by an electronic device including a plurality of application programs, cause the electronic device to: