CN114363144B

CN114363144B - Fault information association reporting method and related equipment for distributed system

Info

Publication number: CN114363144B
Application number: CN202011040443.5A
Authority: CN
Inventors: 余亮; 张亮; 鲁志军; 李煜
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2023-06-27
Anticipated expiration: 2040-09-28
Also published as: WO2022063032A1; CN114363144A

Abstract

The application provides a fault information association reporting method and related equipment for a distributed system. Wherein the method comprises the following steps: the first device caches the first calling relation; when the first equipment processes the first distributed service to fail, reporting first failure information to a server; the first device searches a second device from the first calling relation; the second device comprises a device which calls the first device or a device which is called by the first device when the first distributed service is executed; the first device sends a first notification to the second device; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information. The method reduces the data storage and analysis cost, can report more fault related information, and improves the fault positioning efficiency.

Description

Fault information association reporting method and related equipment for distributed system

Technical Field

The application relates to the field of computer research, in particular to a fault information association reporting method and related equipment for a distributed system.

Background

To meet the increasing traffic demands, distributed systems are increasingly widely used in response to large-scale application scenarios. The distributed system has high reliability, good expandability and quick communication, and can more conveniently realize resource sharing among users, but because the scale of equipment involved in distributed service in the distributed system is huge, the calling among the equipment and the calling among modules in the equipment are complicated, so that the fault is difficult to locate when the fault occurs.

In the existing fault positioning method, firstly, call among devices is tracked through Trace ID buried points, global index is carried out according to Trace ID of abnormal service after faults occur, and then analysis and positioning are carried out. In the above process, the fault equipment senses the single end, which means that when a fault occurs, only the fault equipment reports fault information, and other equipment participating in processing the abnormal service may not report fault related information, so that the information related to the fault obtained by the server is very limited, which is not beneficial to subsequent fault location analysis. In addition, the fault locating method can collect a large amount of normal business process data, the collected normal business process data has a high probability of being irrelevant to faults, unnecessary data storage cost and analysis cost can be brought, moreover, when the traffic volume is increased, the collected data volume is increased, repeated Trace ID is likely to appear, and the analysis difficulty is also increased.

Therefore, how to report more fault related information in the distributed system so as to effectively perform fault location is a problem to be solved at present.

Disclosure of Invention

The fault information association reporting method and the related equipment for the distributed system can report fault information when faults occur, normal flow data are not required to be collected, data storage and analysis cost is saved, in addition, the fault equipment can report the fault information by itself and also inform the associated equipment (opposite terminal equipment) to report the information, and therefore more effective information can be utilized when a server performs fault analysis.

In a first aspect, the present application provides a fault information association reporting method for a distributed system, where the method includes: the first equipment caches a first calling relation, wherein the first calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by the first equipment; when the first equipment processes the first distributed service to fail, reporting first failure information to a server; the first device searches a second device from the first calling relation; the second device comprises a device which calls the first device or is called by the first device when the first distributed service is executed; the first device sends a first notification to the second device, wherein the first notification is used for indicating the second device to report second fault information to the server; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information; the second fault information includes fault information of the second device when processing the first distributed service.

In the scheme provided by the application, the calling relation caching technology is used, information collection is not carried out on irrelevant equipment information and equipment calling information of the fault-free distributed service, and data storage and analysis cost is saved. In addition, the equipment with faults not only reports the fault information, but also sends a message to the equipment with calling relation to inform the equipment to report the respective fault related information to the server, and the fault information reporting mode enables the server to have more fault related information available when the fault positioning is carried out, so that the efficiency and the accuracy of the fault positioning are improved.

With reference to the first aspect, in a possible implementation manner of the first aspect, the searching, by the first device, the second device from the first call relationship includes: the first device searches the second device from the first device calling information through the first identifier; the first device call information is device call information of the first distributed service in the first call relationship.

In the scheme provided by the application, the calling information related to each distributed service is provided with one identifier, and each identifier is cached in the device together with the corresponding device calling information of the distributed service, so that when the device fails, the cached device calling information of the distributed service can be found through the identifier in the failure information, and the opposite terminal device can be found, and the opposite terminal device refers to the device called by the failure device or the device calling the failure device in the distributed service.

With reference to the first aspect, in a possible implementation manner of the first aspect, the caching a first call relationship includes: dividing the first calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding service of the first calling relation of different states is different; and caching the first calling relations of the different states separately.

In the scheme provided by the application, the cache space of the equipment is limited, when the space is insufficient to cache the call relation of the next distributed service, the call relation of the first cache is cleared, the call relation of different distributed services is divided into different states according to the service life cycle, and then the service is cached, so that the situation that the service with short life cycle occupies most of the cache space is avoided, and the service with long life cycle and the service with short life cycle have respective cache spaces and are not affected each other.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: if the first device sends the first notification to the second device, the first device fails to send and caches a first association failure event; the first association failure event is used for representing that the first notification is failed to be sent.

In the scheme provided by the application, if the fault equipment does not successfully notify the opposite terminal equipment to report the fault information, the information of the notification failure is not directly cleared, but the association failure event is cached, and the association failure event comprises the related information for notifying the opposite terminal equipment, so that the fault equipment can subsequently notify the opposite terminal equipment to report the fault information again, and particularly under the service failure scene caused by external factors such as a network, the success rate of notifying the opposite terminal equipment can be greatly improved by the information acquisition mechanism.

With reference to the first aspect, in a possible implementation manner of the first aspect, the caching a first association failure event includes: the first device determining whether there is sufficient buffer space for buffering the first association failure event; when enough buffer space is currently available for buffering the first association failure event, buffering the first association failure event; when no enough buffer space is currently available for buffering the first association failure event, clearing a second association failure event; the second association failure event is the association failure event with the longest cache time in the first equipment; if the second association failure event is cleared, enough buffer space is available for buffering the first association failure event, and the first association failure event is buffered; if the second association failure event is cleared, enough buffer space is still not available for buffering the first association failure event, and a third association failure event is cleared; and after the second association failure event is cleared, the association failure event with the longest caching time in the first device is obtained.

In the scheme provided by the application, the quota and the aging mechanism are provided for the relevant information of the notification failure of the cache, namely the association failure event which can be cached by the fault equipment is limited, when insufficient space is left for continuous caching, the equipment can clear the association failure event cached earliest, and the mechanism avoids the situation that the storage space of the fault equipment is occupied by too much unimportant information, and saves the storage space of the fault equipment.

With reference to the first aspect, in a possible implementation manner of the first aspect, the method further includes: checking whether the first association failure event exists when the first device is on line again or processes the distributed service again; and when the first notification is successfully sent, clearing the cached first association failure event.

In the scheme provided by the application, when the fault equipment is on line again or the distributed service is processed again, whether the association failure event is cached is checked, if the association failure event is cached, the fault equipment can inform the opposite terminal equipment again to report the fault related information, a reliable information acquisition mechanism is provided in the mode, and the loss of positioning information is reduced for an unreliable communication link, and meanwhile, the association success rate is greatly improved.

In a second aspect, a fault information association reporting method for a distributed system is provided, where the method includes: the second equipment caches a second calling relation, wherein the second calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by the second equipment; the second device receives a first notification sent by the first device, wherein the first notification is used for indicating the second device to report second fault information to the server; the second equipment reports second fault information to the server; or under the condition that the second equipment does not report the second fault information, the second equipment reports the second fault information to the server; the second fault information includes fault information of the second device when processing the first distributed service.

In the scheme provided by the application, after receiving the notification information of the fault equipment, the opposite terminal equipment of the fault equipment can report the own fault related information to the server, specifically, the opposite terminal equipment can directly report after receiving the notification information of the fault equipment, and can also check whether the fault related information is already reported or not.

With reference to the second aspect, in a possible implementation manner of the second aspect, the method further includes: the second device searches a third device from the second calling relation; the third device comprises a device which calls the second device or is called by the second device when the first distributed service is executed; the second device sends a second notification to the third device, wherein the second notification is used for indicating the third device to report third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

In the scheme provided by the application, the fault equipment not only reports the fault information by itself, but also notifies the opposite terminal equipment to report the fault related information, and the mode can enable the server to obtain complete fault related information, so that the follow-up fault positioning analysis is convenient.

With reference to the second aspect, in a possible implementation manner of the second aspect, the searching, by the second device, a third device from the second call relationship includes: the second device searches a third device from the second device call information through the first identifier; the second device call information is the device call information of the first distributed service in the second call relationship.

In the scheme provided by the application, the message sent by the fault equipment to the opposite terminal equipment contains the identifier corresponding to the abnormal distributed service, the opposite terminal equipment can search the opposite terminal equipment from the calling relation through the identifier, and then the opposite terminal equipment is notified to report the fault related information, so that the time for searching the opposite terminal equipment is saved.

With reference to the second aspect, in a possible implementation manner of the second aspect, the caching the second call relation includes: dividing the second calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding second calling relation of different states is different; and caching the second calling relation of the different states separately.

In the scheme provided by the application, the calling relations of different distributed services are divided into different states according to the service life cycle and then cached, and the caching mode enables the calling relations of the devices of the services with different life cycles to be stored respectively without mutual influence.

With reference to the second aspect, in a possible implementation manner of the second aspect, the method further includes: if the second device sends the second notification to the third device, the sending fails, and a fourth association failure event is cached; the fourth association failure event is used for representing that the second notification fails to be sent.

In the scheme provided by the application, a reliable information acquisition mechanism is established, when the equipment fails to successfully notify the opposite terminal equipment to report the fault related information, the associated failure event is cached so as to carry out subsequent re-notification, and the information acquisition mechanism reduces the loss of positioning information and greatly improves the success rate of notifying the opposite terminal equipment.

With reference to the second aspect, in a possible implementation manner of the second aspect, the caching a fourth association failure event includes: the second device determining whether there is sufficient buffer space for buffering the fourth association failure event; when enough buffer space is currently available for buffering the fourth association failure event, buffering the fourth association failure event; when there is not enough buffer space for buffering the fourth association failure event, clearing the fifth association failure event; the fifth association failure event is the association failure event with the longest cache time in the second device; if the fifth association failure event is cleared, enough buffer space is available for buffering the fourth association failure event, and the fourth association failure event is buffered; if the fifth association failure event is cleared, enough buffer space is still not available for buffering the fourth association failure event, and a sixth association failure event is cleared; and after the fifth association failure event is cleared, the association failure event with the longest caching time in the second device.

In the scheme provided by the application, when the device does not have enough cache space for storing the association failure event, the device can clear the association failure event stored earliest, so that the situation that the memory is occupied by excessive garbage is avoided, and the method is a representation of reasonably using the storage space of the device.

With reference to the second aspect, in a possible implementation manner of the second aspect, the method further includes: checking whether the fourth association failure event exists when the second device is on line again or processes the distributed service again; when the fourth association failure event exists, sending the second notification to the third device; and when the second notification is successfully sent, clearing the cached fourth association failure event.

In the scheme provided by the application, each time the equipment is on line again or the distributed service is processed again, whether the association failure event is cached is checked, if the association failure event is cached in the equipment, the equipment can inform the opposite terminal equipment again to report the fault related information, and the mode enables all the related equipment to be notified as much as possible, so that the loss of the fault related information is reduced, and the association success rate is greatly improved.

In a third aspect, there is provided a first apparatus comprising: the first caching unit is used for caching a first calling relation, and the first calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by first equipment; the first processing unit is used for reporting first fault information to the server when the first equipment processes the first distributed service to fail; searching a second device from the first calling relation; the second device comprises a device which calls the first device or is called by the first device when the first distributed service is executed; a first sending unit, configured to send a first notification to the second device, where the first notification is used to instruct the second device to report second fault information to the server; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information; the second fault information includes fault information of the second device when processing the first distributed service.

With reference to the third aspect, in a possible implementation manner of the third aspect, the first processing unit is configured to, when searching for the second device from the first call relationship, specifically: searching a second device from the first device call information through the first identifier; the first device call information is device call information of the first distributed service in the first call relationship.

With reference to the third aspect, in a possible implementation manner of the third aspect, the first buffer unit is specifically configured to: dividing the first calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding service of the first calling relation of different states is different; and caching the first calling relations of the different states separately.

With reference to the third aspect, in a possible implementation manner of the third aspect, the first buffer unit is further configured to: if the first sending unit sends the first notification to the second device, the sending fails, and a first association failure event is cached; the first association failure event is used for representing that the first notification is failed to be sent.

With reference to the third aspect, in one possible implementation manner of the third aspect, when the first buffering unit buffers the first association failure event, the method is specifically used to: determining whether there is sufficient buffer space to buffer the first association failure event; when enough buffer space is currently available for buffering the first association failure event, buffering the first association failure event; when no enough buffer space is currently available for buffering the first association failure event, clearing a second association failure event; the second association failure event is the association failure event with the longest cache time in the first equipment; if the second association failure event is cleared, enough buffer space is available for buffering the first association failure event, and the first association failure event is buffered; if the second association failure event is cleared, enough buffer space is still not available for buffering the first association failure event, and a third association failure event is cleared; and after the second association failure event is cleared, the association failure event with the longest caching time in the first device is obtained.

With reference to the third aspect, in a possible implementation manner of the third aspect, the first processing unit is further configured to check whether the first association failure event exists when the first device is on line again or processes a distributed service again; the first sending unit is further configured to send the first notification to the second device when the first association failure event exists; the first buffer unit is further configured to clear the buffered first association failure event after the first sending unit successfully sends the first notification.

In a fourth aspect, there is provided a second device comprising: the second cache unit is used for caching a second call relationship, and the second call relationship comprises equipment call information of one or more distributed services which are initiated by a user and are participated in processing by second equipment; the first receiving unit is used for receiving a first notification sent by first equipment, and the first notification is used for indicating the second equipment to report second fault information to the server; the second processing unit is used for reporting second fault information to the server by the second equipment; or under the condition that the second equipment does not report the second fault information, the second equipment reports the second fault information to the server; the second fault information includes fault information of the second device when processing the first distributed service.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second processing unit is further configured to find a third device from the second calling relationship, where the third device includes a device that invokes or is invoked by the second device when executing the first distributed service; the second device further comprises a second sending unit, configured to send a second notification to the third device, where the second notification is used to instruct the third device to report third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second processing unit is configured to, when searching for a third device from the second call relationship, specifically: searching a third device from the second device call information through the first identifier; the second device call information is the device call information of the first distributed service in the second call relationship.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second caching unit is configured to cache the second call relationship, specifically configured to: dividing the second calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding second calling relation of different states is different; and caching the second calling relation of the different states separately.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second buffer unit is further configured to: if the second sending unit sends the second notification to the third device, the sending fails, and a fourth association failure event is cached; the fourth association failure event is used for representing that the second notification fails to be sent.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second buffering unit is configured to buffer a fourth association failure event, specifically configured to: determining whether there is sufficient buffer space to buffer the fourth association failure event; when enough buffer space is currently available for buffering the fourth association failure event, buffering the fourth association failure event; when there is not enough buffer space for buffering the fourth association failure event, clearing the fifth association failure event; the fifth association failure event is the association failure event with the longest cache time in the second device; if the fifth association failure event is cleared, enough buffer space is available for buffering the fourth association failure event, and the fourth association failure event is buffered; if the fifth association failure event is cleared, enough buffer space is still not available for buffering the fourth association failure event, and a sixth association failure event is cleared; and after the fifth association failure event is cleared, the association failure event with the longest caching time in the second device.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, the second processing unit is further configured to check whether the fourth association failure event exists when the second device is on line again or processes a distributed service again; the second sending unit is further configured to send the second notification to the third device when the fourth association failure event exists; and when the second notification is successfully sent, clearing the cached fourth association failure event.

In a fifth aspect, a computing device is provided, where the computing device includes a processor and a memory, where the memory is configured to store program code, and where the processor is configured to execute the first aspect and the fault information association reporting method for a distributed system provided in connection with any implementation manner of the first aspect.

In a sixth aspect, a computing device is provided, where the computing device includes a processor and a memory, where the memory is configured to store program code, and where the processor is configured to execute the second aspect and the fault information association reporting method for a distributed system provided in connection with any implementation manner of the second aspect.

In a seventh aspect, a computer readable storage medium is provided, where a computer program is stored, where the computer program may implement the functions of the first aspect and the fault information association reporting method for a distributed system provided in connection with any implementation manner of the first aspect, when the computer program is executed by a processor.

In an eighth aspect, a computer readable storage medium is provided, where a computer program is stored, where the computer program can implement the functions of the second aspect and the fault information association reporting method for a distributed system provided in connection with any implementation manner of the second aspect, when the computer program is executed by a processor.

In a ninth aspect, the present application provides a computer program product, where the computer program product includes instructions, when the computer program product is executed by a computer, enable the computer to perform the above first aspect and the flow of the fault information association reporting method for a distributed system provided in connection with any implementation manner of the above first aspect.

In a tenth aspect, the present application provides a computer program product, where the computer program product includes instructions, when the computer program product is executed by a computer, enable the computer to perform the second aspect and the flow of the fault information association reporting method for a distributed system provided in connection with any implementation manner of the second aspect.

In an eleventh aspect, the present application provides a chip system comprising a processor for supporting a first device to implement the functions referred to in the first aspect above. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the data transmission device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

In a twelfth aspect, the present application provides a chip system comprising a processor for supporting a second device to implement the functions referred to in the second aspect above. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the data transmission device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

It will be appreciated that the first device provided by the third aspect provided above, the computing device provided by the fifth aspect, the computer readable storage medium provided by the seventh aspect, the computer program product provided by the ninth aspect, and the system-on-chip provided by the eleventh aspect are all configured to perform the fault information association reporting method for the distributed system provided by the first aspect. Therefore, the advantages achieved by the method for reporting the fault information related to the distributed system according to the first aspect will be referred to, and will not be described herein.

It will be appreciated that the second device provided by the fourth aspect provided above, the computing device provided by the sixth aspect, the computer readable storage medium provided by the eighth aspect, the computer program product provided by the tenth aspect, and the chip system provided by the twelfth aspect are each configured to perform the fault information association reporting method for the distributed system provided by the second aspect. Therefore, the advantages achieved by the method for reporting the fault information related to the distributed system according to the second aspect will be referred to, and will not be described herein.

Drawings

Fig. 1 is a schematic diagram of a system architecture of a fault information association reporting method for a distributed system according to an embodiment of the present application;

fig. 2 is a schematic diagram of a system architecture of another fault information association reporting method for a distributed system according to an embodiment of the present application;

fig. 3 is a flow chart of a fault information association reporting method for a distributed system according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a call relationship cache according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another call relationship cache according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a remote invocation provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a mechanism for buffering association failure events according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a buffering process of an association failure event according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a first device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a second device according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a computing device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of another computing device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application, taken in conjunction with the accompanying drawings, will clearly be given as a whole, but not as a whole. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

First, some of the expressions and related techniques referred to in the present application are explained in conjunction with the drawings to facilitate understanding by those skilled in the art.

The distributed system (Distributed System) is a software system built on top of a network. In a distributed system, a group of independent computers presents to the user a unified whole as if it were a system. The system has various general physical and logical resources, can dynamically allocate tasks, and the scattered physical and logical resources realize information exchange through a computer network. Typically, a distributed system has only one model or paradigm for a user. There is a layer of software middleware (Middle Ware) on top of the operating system that is responsible for implementing this model.

Buried points are terms in the field of data collection, and in particular in the field of user behavior data collection, and refer to related techniques and their implementation for capturing, processing and transmitting specific user behaviors or events. The technical essence of the embedded point is to monitor events in the running process of the software application, and judge and capture the events when the events needing to be concerned occur. The buried points include code buried points, visual buried points and no buried points. The code embedded points report data when a user triggers corresponding actions by adding some codes; the visual buried points utilize visual interaction means, related events are configured firstly, and then data acquisition is carried out; the no-burial point means that after a developer integrates and collects the software development kit (Software Development Kit, SDK), the SDK directly starts to capture and monitor all the behaviors of the user in the application and report all the behaviors without adding extra codes by the developer. It should be noted that, in the embodiment of the present application, the buried point is irrelevant to the user behavior, and is only relevant to the fault on the service flow.

Remote procedure calls (Remote Procedure Call, RPC) occur primarily to address the issue of communication transparency between distributed systems. That is, the RPC lets the user not have to tell which service on the device is invoked, and this remote service is as safe and reliable as invoking the local service from the user's perspective. Remote invocation generally involves four procedures: client Send (CS), server Received (SR), server Send (SS), and Client Received (CR).

The universally unique identifier (Universally Unique Identifier, UUID) is a standard for software construction, and is also part of the field of distributed computing environments organized by the open software foundation, the most widely used UUID at present being Microsoft's Globally Unique Identifiers (GUIDs) by Microsoft. The UUID is designed to enable all elements in the distributed system to have unique identification information without requiring identification information assignment via the central control terminal. In this way, each element can establish a UUID that does not conflict with other elements. In such a case, the name duplication problem at the time of database creation need not be considered. UUID refers to a number generated on one machine that ensures that it is unique to all machines in the same space-time. Typically the platform will provide a generated application programming interface (Application Programming Interface, API). The ethernet card address, nanosecond time, chip ID code and many possible digits are used according to standard calculations established by the Open Software Foundation (OSF).

UUID consists of the following parts:

(1) The first part of the UUID is time dependent, and is different if you generate a UUID a few seconds later after it is generated, and the rest is the same.

(2) A clock sequence.

(3) Globally unique IEEE machine identification number.

The Event ID (Event ID) is a number and represents error information, namely, causes for preventing the service from proceeding, and the error information represented by different Event IDs is different, so that a fault module, a fault influence and even a root cause can be roughly known according to the Event ID, and the problem is solved.

Fault logs, i.e., error logs, are text files that software uses to record the information of errors during operation. Programmer and maintainer etc. can use the error log to debug and maintain the system.

Open APIs are Open APIs, also known as Open platforms. The Open API (Open API) is a common application of service-type websites, where a website server encapsulates its own website services into a series of application programming interfaces (Application Programming Interface, API) to be opened for third party developers to use, where the action is called an Open API of the website, and the Open API is called an Open API (Open API).

In order to facilitate understanding of the embodiments of the present application, a description will be first given of a system architecture for reporting fault information of a distributed system according to the embodiments of the present application. As shown in fig. 1, fig. 1 is a fault information reporting system architecture provided in an embodiment of the present application, including at least one server and at least one device. Fig. 1 illustrates an example including a plurality of servers and a plurality of devices. The device A, B, C, D, E is a device involved in a certain distributed service flow initiated by a user, and 4 RPC calls in the distributed service occur between the device a and the device B, between the device B and the device C, between the device C and the device D, and between the device D and the device E respectively. When a device fails during the process of processing the distributed service by the device A, B, C, D, E, and the distributed service is abnormal, the failed device reports the failure information to the server.

It should be noted that the distributed service includes, but is not limited to, a wireless end request, a web page request, and an Open API request. Taking a webpage request as an example, when a user performs a clicking operation on a certain webpage, the clicking operation may involve calling other devices (such as a sub-server) by the webpage, and may involve operations of receiving and sending messages with other applications, namely, a message service, and may involve querying and updating a distributed database, reading and writing cache and storage of a distributed cache, storage of a distributed object, and the like.

In addition, the distributed service also comprises some operations under the cooperative scene of the intelligent terminal and the peripheral equipment, such as cooperative file transmission, which means that the method and the device can be applied to near-field distributed cooperative scenes such as large-screen projection, PC cooperation, PAD cooperation, intelligent wearing and the like.

In a distributed system, call between devices is complicated, and often one device may be called by a plurality of devices. As shown in fig. 2, fig. 2 is a system architecture for reporting fault information according to another embodiment of the present application. The device A, C, D, F, H is a device involved in a distributed service initiated by a user, in which 6 RPC calls respectively occur between the device a and the device C, between the device a and the device D, between the device C and the device F, between the device C and the device H, and between the device D and the device H. When a device fails during the process of processing the distributed service by the device A, C, D, F, H, and the distributed service is abnormal, the failed device reports the failure information to the server.

It will be appreciated that the devices in fig. 1 and 2 include, but are not limited to, servers, routers, switches, gateways, and end devices such as cell phones, computers, tablets, etc.

In addition, the servers in fig. 1 and 2 may be ordinary servers (also called physical servers), which are physical devices that exist in reality and may be placed in a machine room to operate, where the ordinary servers have independent hard disks, bandwidths, and the like; the server can also be a server generated after a common server is virtualized to different degrees, and part of the server is virtualized, and only part of the server is possibly a real physical device; the cloud server is a type of server with the characteristics of high distribution, high virtualization and the like, wherein the computing resources are obtained by scheduling from a large number of physical servers subjected to integrated virtualization, and the virtualization scale can be several, tens or hundreds of physical servers or a large cloud virtual resource pool built by thousands of entity hardware crossing a data center from the node scale, so that the cloud server supports the flexible adjustment of resources, which means that resources such as a CPU, a memory, a disk, a bandwidth and the like can be freely increased or reduced, and meanwhile, the cloud server has good expandability and high reliability.

It can be appreciated that the system architecture of the fault information reporting shown in fig. 1 and fig. 2 is only two exemplary implementations of the embodiments of the present application, and the system architecture of the fault information reporting in the embodiments of the present application includes, but is not limited to, the above structures.

In order to avoid collecting a large amount of normal business process data, reduce data storage and analysis cost, collect more information related to faults as much as possible, and improve fault positioning accuracy, the application provides a distributed system-oriented fault information association reporting method.

The following specifically refers to a flowchart of a fault information association reporting method for a distributed system provided in the embodiment of the present application and shown in fig. 3, and describes the fault information association reporting method for a distributed system in the embodiment of the present application. As shown in fig. 3, the method may include the steps of:

S310: the first device caches the first call relationship.

Specifically, when a user initiates one or more distributed services, each initiated distributed service flow may involve at least one device, and call relationships exist between the devices. Table 1 below:

TABLE 1

In addition, each device involved in processing the distributed service may cache its own call relationships. For example, the first device and the second device in the embodiments of the present application are devices that are involved in processing one or more distributed services initiated by a user. The first device caches the first call relationship and the second device caches the second call relationship. Table 2 below:

TABLE 2

Alternatively, the first device may be any device in the system architecture shown in fig. 1 and fig. 2 and facing to the reporting of fault information of the distributed system, for example, the first device may be device a in fig. 1, where the second device is device B in fig. 1; the first device may be device C of fig. 2, in which case the second device is device a, device B, and device C of fig. 2.

As shown in table 2, the first call relationship includes device call information of the distributed service that the first device participates in, where the device call information refers to call information related to the first device in the distributed service that the first device participates in, that is, information that the first device calls other devices and information that the first device is called by other devices.

It should be noted that, in an embodiment of the present application, the device invocation information of the one or more distributed services that the first device participates in processing may include a peer type of the first device, a service type of the distributed service, communication information related to the first device in the distributed service, and a timestamp. The opposite terminal type comprises a UUID of opposite terminal equipment and an opposite terminal equipment type, wherein the opposite terminal equipment refers to equipment called by first equipment or equipment called by the first equipment in the distributed service, and the opposite terminal equipment type comprises but is not limited to a terminal, a server, a router, a module in a software and/or hardware system; the traffic types include, but are not limited to: basic telecommunication services such as fixed communication services, cellular mobile communication services, first type satellite communication services, first type data communication services and the like, basic telecommunication services such as trunking communication services, wireless paging services, second type satellite communication services, second type data communication services, network access services, national communication facility service services, network hosting services and the like, value added telecommunication services such as first type value added telecommunication services, second type value added telecommunication services and the like; the communication information includes, but is not limited to, communication information between the first device as a calling party and the called party device, communication information between the distributed service device as a called party and the calling device; the timestamp includes, but is not limited to, a calling time when the first device is the caller device and a called time when the first device is the callee device.

In addition, the first calling relationship can be divided into at least two states according to the service life cycle, and then the first calling relationship in different states is cached separately, wherein the service life cycles corresponding to the first calling relationship in different states are different.

In one embodiment of the present application, the first call relationship may be divided into a cold state, a warm state, and a hot state according to a service life cycle, and then the first call relationship marked as the cold state, the warm state, and the hot state is separately cached. Specifically, when the service life cycle is smaller than or equal to a first threshold value, marking the first calling relation as a hot state; when the service life cycle is greater than a first threshold value and less than or equal to a second threshold value, marking the first calling relation as a warm state; and when the service life cycle is greater than a second threshold value, marking the first calling relation as a cold state. The first calling relations of the three states are cached in three areas respectively and are not influenced mutually.

It will be appreciated that the first threshold and the second threshold are set by a developer according to actual situations, which is not limited in this application.

For example, the first threshold is set to be 3 minutes, the second threshold is set to be 4 hours, and if the screen-throwing operation of throwing the mobile phone screen onto the PC screen is maintained for 3 hours, the calling relationship between the mobile phone and the PC is marked as a warm state; if a Bluetooth device match is left for 5 hours to unbind, then the Bluetooth device-to-device call relationship is marked as cold.

In another embodiment of the present application, the first call relationship may be divided into a first state, a second state, a third state and a fourth state according to a service life cycle, and then the first call relationship marked as the first state, the second state, the third state and the fourth state is separately cached. Specifically, when the service life cycle is smaller than or equal to a third threshold value, marking the first calling relation as a first state; when the service life cycle is greater than or equal to a fourth threshold value and less than or equal to a fifth threshold value, marking the first calling relation as a second state; when the service life cycle is greater than or equal to a sixth threshold value and less than or equal to a seventh threshold value, marking the first calling relation as a third state; and when the service life cycle is greater than or equal to an eighth threshold value, marking the first calling relation as a fourth state. The first calling relations of the four states are cached in three areas respectively and are not influenced mutually.

It is understood that the fourth threshold is greater than the third threshold, the fifth threshold is greater than or equal to the fourth threshold, the sixth threshold is greater than the fifth threshold, the seventh threshold is greater than or equal to the sixth threshold, and the eighth threshold is greater than the seventh threshold. In addition, the third threshold, the fourth threshold, the fifth threshold, the sixth threshold, the seventh threshold, and the eighth threshold may be set by a developer according to actual situations, and specific numerical values thereof are not limited in this application.

Illustratively, the fourth threshold and the fifth threshold are both set to 2 hours, and the device invocation relationship to which the service relates is marked as the second state if and only if the lifecycle of the service is 2 hours.

S320: when the first device processes that the first distributed service fails, the first failure information is reported to the server.

Specifically, when the first device fails in processing the first distributed service, first failure information is generated, and the first failure information is reported to the server.

It should be noted that, in one embodiment of the present application, the first fault information includes a first fault event and a first fault log. The first fault Event includes a device type of a fault device (i.e., a first device), an Event ID, a fault time, a fault module, and an exception type, where the device type includes, but is not limited to, a module in a terminal, a server, a router, software, and/or a hardware system, the fault module refers to a module in the first device that has a fault, and the module may be a hardware module or a software module, and the exception type refers to a type in which the first device has a fault, and may be no response, a card, a negotiation failure, a timeout, and so on; the first fault log refers to a log of the first device associated with the fault.

S330: the first device searches the second device from the first call relationship.

In particular, as can be seen from the above, the first device caches a first call relationship, and the first call relationship includes device call information of the distributed service that the first device participates in processing, so that the second device can be searched through the first call relationship, where the second device includes a device that calls or is called by the first device when executing the first distributed service.

In one embodiment of the present application, the objective of searching for the peer device may be achieved by assigning an identifier to each distributed service.

It should be noted that, the identifier may be cached in the first call relationship in the first device, specifically, as shown in fig. 4, fig. 4 is a schematic diagram of a device call relationship for caching a distributed service, and as can be seen from fig. 4, the identifier may be used as a key index to cache device call information of the distributed service, that is, the identifier may be used as an identifier for distinguishing different distributed services, and the device call information of the different distributed services may be cached respectively, which means that, if the device call information of a certain distributed service is to be searched, the cached device call information of the distributed service may be obtained by searching the identifier of the distributed service. Optionally, as shown in fig. 5, fig. 5 is a schematic diagram of another device call relationship of a cache distributed service, as can be seen from fig. 5, the identifiers of the same distributed service may be stored together with the device call information thereof, and the identifiers of different distributed services and the device call information thereof may be placed in different cache spaces, where the identifiers of different distributed services and the device call information thereof may be stored separately, which means that when a device call relationship of a certain distributed service needs to be found, the identifiers need to be checked one by one until the cache space where the identifiers corresponding to the distributed service are found, so as to find the device call information of the distributed service.

When each distributed service is allocated with an identifier, the first calling relationship includes the identifier of one or more distributed services initiated by the user and further includes the UUID of the opposite terminal device, and when the first device fails in processing the first distributed service, first failure information is generated, where the first failure information includes not only the first failure event and the first failure log, but also the first identifier corresponding to the first distributed service, as described above, so that the first device may find the first device calling information in the first calling relationship by searching the first identifier, where the first device calling information is the device calling information of the first distributed service in the first calling relationship, so as to find the UUID and the communication information of the opposite terminal device of the first device, that is, find the UUID of the second device and the communication information between the first device and the second device.

It should be noted that, in an embodiment of the present application, an identifier may be obtained by burying a middleware in a distributed system, and at the same time, a call relationship of a device may also be obtained, because by burying a middleware in a distributed system, a service flow may be tracked in a process of occurrence of a distributed service, so that in devices participating in different distributed services, log records related to the different services may be associated with different identifiers, and at the same time, a UUID, branch information, device type and other device information of a device related to the distributed service and a call relationship between devices may also be obtained. It is to be appreciated that the middleware includes, but is not limited to, remote procedure call middleware, data access middleware, message middleware, transaction middleware, object middleware, and terminal emulation/screen conversion middleware.

The method includes the steps that a first-stage switch is opened first, tracking is started, if a user initiates a distributed service, the Hitrace system tracks the distributed service and records working information for processing the distributed service, in addition, the Hitrace system records an identifier, namely Trace ID, for a log record corresponding to related equipment, the Trace ID is used for associating the corresponding log record with the distributed service, so that records corresponding to the distributed service can be found in logs of related equipment smoothly according to the Trace ID. Then, a second-stage switch is turned on to determine information between tracking devices, at this time, the Hitrace system records the distributed service call information, that is, obtains information of devices at two ends of each call, so as to determine a call sequence of the devices, for example, as shown in fig. 6, fig. 6 shows remote call of the device a to the device B, the device a (client) initiates a request (CS), the device B (server) receives the request (SR), then the device B (server) processes and sends a result to the device a (client) (SS), and finally the device a (client) obtains return information (CR) of the device B (server), which can be understood that when the second-stage switch is turned on, the call sequence between the device a and the device B can be clearly known, and relevant log records of four time nodes of cs\sr\ss\cr can be output.

Alternatively, in addition to the information between tracking devices being determined by opening a switch, in some embodiments of the present application, the information between tracking processes may be determined by opening a switch, and the information between tracking threads may also be determined by opening a switch.

It will be appreciated that Trace ID mentioned in the examples is one expression of the above-described identifier, and that other ways of obtaining and expressing the identifier are possible, and this is not a limitation in the present application.

S340: the first device sends a first notification to the second device.

Specifically, after the first device obtains the UUID of the second device in the first call relationship, a first notification is sent to the second device according to the UUID of the second device, where the first notification includes the Event ID and the first identifier.

Optionally, the first device may detect whether the first identifier and the UUID of the second device exist in the device, and if so, indicate that the first device has sent a first notification to the second device, that is, the second device has reported the second fault information, where the first device does not need to send the first notification to the second device, that is, does not need to transmit the UUID of the second device, the Event ID, and the first identifier to the second device.

It should be noted that the first device may not be able to successfully send the first notification to the second device, for example, when a communication failure occurs between the first device and the second device, the first device may not be able to send a message to the second device, which means that the first device may not be able to notify the second device to report the second failure information at this time.

When the above situation occurs, as shown in fig. 7, the first device caches a first association failure event, before caching the first association failure event, the first device determines whether there is enough cache space to cache the first association failure event, and when there is enough cache space to cache the first association failure event, the first association failure event is cached; when no enough buffer space is currently available for buffering the first association failure event, clearing a second association failure event; if the second association failure event is cleared, enough buffer space is available for buffering the first association failure event, and the first association failure event is buffered; if the second association failure event is cleared, there is still insufficient buffer space for buffering the first association failure event, and a third association failure event is cleared.

It should be noted that, the second association failure event is an association failure event with the longest buffering time in the first device; and after the second association failure event is cleared, the association failure event with the longest caching time in the first device is obtained. In one embodiment of the present application, the association failure Event includes a time of notifying a failure, an identifier, an Event ID, and a UUID of a peer device, where the peer device refers to a device called by a failed device or a device calling the failed device in a distributed service where an abnormality occurs, and at this time, the first association failure Event includes a time when the first device sends a first notification failure, an Event ID corresponding to a failure occurring in the first device, the first identifier, and a UUID of the second device.

In addition, the cache space may be a predetermined space, i.e. a space different from the system memory or the memory.

It will be appreciated that the first device described above fails to send the first notification to the second device, including the first device not successfully sending the first notification and the second device not successfully receiving the first notification.

In addition, as shown in fig. 8, when the first device is on-line again or processes the distributed service again, the first device queries or checks whether the first association failure event exists in the device; if the first association failure Event exists, the first equipment sends a first notification to the second equipment, namely, an Event ID and a first identification in the first association failure Event are transmitted to the second equipment to notify an opposite terminal (the second equipment) to report failure information; and if the first notification is sent successfully, clearing the cached first association failure event.

S350: the second device receives a first notification sent by the first device.

Specifically, the second device receives the first notification sent by the first device, that is, the second device may receive the Event ID and the first identifier corresponding to the fault occurring in the first device, so the second device may learn, through the received first identifier, that the first distributed service is abnormal.

Before the second device receives the first notification sent by the first device, the second device also needs to cache a second call relationship, where the second call relationship includes device call information of the distributed service that the second device participates in processing, where the device call information refers to call information related to the second device in the distributed service that the second device participates in processing, that is, information that the second device calls other devices and information that the second device is called by the other devices.

Similar to the first call relationship, the device call information of the distributed service that the second device participates in the processing may include a peer type of the first device, a service type of the distributed service, communication information related to the second device in the distributed service, and a timestamp. The content of the device invocation information of the one or more distributed services that the first device participates in the processing may be referred to herein, and will not be described in detail herein.

In addition, the second calling relationship can be divided into at least two states according to the service life cycle, the service life cycles corresponding to the second calling relationship in different states are different, and then the second calling relationship in different states is cached separately. Reference may be made to the example of the first call relationship above, and no further description is given here.

S360: and the second equipment reports the second fault information to the server.

Specifically, after the second device receives the first notification sent by the first device, the second device reports the second fault information to the server. It should be noted that, in an embodiment of the present application, the second fault information includes a first identifier, an Event ID, and a second fault log, where the second fault log refers to a log record related to a fault in a log of the second device.

It is understood that both the first device and the second device are involved in processing the first distributed service, and therefore both are cached with the device invocation information and the first identification of the first distributed service.

Optionally, after the second device receives the first notification sent by the first device, the second device detects whether the first identifier exists in the device, if so, it indicates that the second device has reported the second fault information, and at this time, the second device does not need to report the second fault information to the server, that is, does not need to report the received Event ID, the first identifier and the second fault log to the server.

As shown in fig. 2, in the system architecture for reporting fault information shown in fig. 2, one distributed service initiated by a user relates to five devices, namely, a device a, a device C, a device D, a device F and a device H, wherein call relations exist between the device a and the device C, between the device a and the device D, between the device C and the device F, between the device C and the device H and between the device D and the device H, and in general, when the device a fails, the device a reports fault information and sends a message to the device C and the device D having call relations to notify the device C and the device D to report fault information. However, when there is information related to the device a in the device C, multi-level association needs to be performed, that is, the device C will send a message to the device D, the device F and the device H having a call relationship after receiving the message sent by the device a, and the device D will send a message to the device C and the device H having a call relationship after receiving the message sent by the device a.

It should be noted that, in one embodiment of the present application, the fault information reported to the server by the faulty device (first device) and the fault information reported to the server by the other device (such as the second device) having a call relation with the faulty device are different. In this embodiment, when a device fails, a failure event is generated, and at the same time, an identifier of a corresponding service is displayed, at this time, the device collects a failure log corresponding to the failure event, and then reports the failure event, the failure log and the identifier to a server, and the failure device sends a message to notify the device having a calling relationship to report failure information to the device, and after receiving the message, the device having the calling relationship reports the failure information to the server. It can be understood that the fault Event may include a device type, a fault time, an Event ID, a fault module, and an exception type, where the device type may be a terminal device such as a mobile phone, or may be other devices such as a router; the failure time includes the time at which the failure occurred; the fault type corresponds to error information represented by Event IDs, for example, the ID range of Windows Event IDs is 0-5073, and the error information represented by each Event ID is different, so that when a fault occurs, what type of fault occurs to the device can be known according to the Event ID; the fault module can be a certain hardware or software module inside the equipment, and can also be a process or middleware; types of exceptions include, but are not limited to, no response, stuck, negotiation failure, timeout.

S370: the second device searches the third device from the second call relationship.

Specifically, as can be seen from the foregoing, the second call relationship includes the identifier of one or more distributed services initiated by the user, and further includes the UUID of the opposite terminal device, and when the second device receives the first notification sent by the first device, the second device may find, in the second call relationship, second device call information according to the first identifier in the received first notification, where the second device call information is the device call information of the first distributed service in the second call relationship, so as to find the UUID and the communication information of the opposite terminal device of the second device, that is, find the UUID of the third device and the communication information between the second device and the third device.

S380: the second device sends a first notification to the third device.

Specifically, after the second device obtains the UUID of the third device in the second call relationship, a second notification is sent to the third device according to the UUID of the third device, where the second notification includes the Event ID and the first identifier.

Optionally, the second device may detect whether the first identifier and the UUID of the third device exist in the device, and if so, indicate that the second device has sent a first notification to the third device, that is, the third device has reported the third fault information, and at this time, the second device does not need to send a second notification to the third device, that is, does not need to transmit the UUID of the third device, the Event ID, and the first identifier to the third device.

Notably, the second device may not be able to successfully send the first notification to the third device, which occurs when the second device determines whether there is sufficient buffer space for buffering the fourth association failure event; when enough buffer space is currently available for buffering the fourth association failure event, buffering the fourth association failure event; when there is not enough buffer space for buffering the fourth association failure event, clearing the fifth association failure event; if the fifth association failure event is cleared, enough buffer space is available for buffering the fourth association failure event, and the fourth association failure event is buffered; if there is still insufficient buffer space for buffering the fourth association failure event after the fifth association failure event is cleared, the sixth association failure event is cleared.

It should be noted that, the fifth association failure event is an association failure event with the longest buffering time in the second device; and after the fifth association failure event is cleared, the association failure event with the longest caching time in the second device.

It is appreciated that the second device sending the second notification to the third device fails, including the second device not successfully sending the second notification and the third device not successfully receiving the second notification.

In addition, when the second device is on line again or processes the distributed service again, the second device checks whether the fourth association failure event exists in the device; if the fourth association failure Event exists, the second device sends a second notification to the third device, namely, the Event ID and the first identification in the fourth association failure Event are transmitted to the third device; and if the sending is successful, clearing the cached fourth association failure event.

It can be understood that the manner in which the second device notifies the third device to report the third fault information is the same as the manner in which the first device notifies the second device to report the second fault information, which is not described herein, and the method in which the first device notifies the second device to report the second fault information may be referred to as an example.

The foregoing details of the method according to the embodiments of the present application are provided for better implementation of the foregoing aspects of the embodiments of the present application, and accordingly, related devices for cooperation implementation are also provided below.

As shown in fig. 9, fig. 9 is a schematic structural diagram of a first device provided in the present application, where the first device is configured to perform the fault location method for the distributed system described in fig. 3. The division of the functional units of the first device is not limited, and each unit in the first device may be increased, decreased or combined as required. In addition, the operations and/or functions of each unit in the first device are respectively for implementing the corresponding flow of the method described in fig. 3, and are not described herein for brevity. Fig. 9 exemplarily provides a division of functional units:

The first device 900 includes a first buffering unit 910, a first processing unit 920, and a first transmitting unit 930.

A first caching unit 910, configured to cache a first call relationship, where the first call relationship includes device call information of one or more distributed services that a first device initiated by a user participates in processing.

The first processing unit 920 is configured to report first failure information to the server when the first device processes that the first distributed service fails; searching a second device from the first calling relation; the second device includes a device that invokes or is invoked by the first device when executing the first distributed service.

A first sending unit 930, configured to send a first notification to the second device, where the first notification is used to instruct the second device to report second fault information to the server; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information; the second fault information includes fault information of the second device when processing the first distributed service.

The three units may perform data transmission through a communication channel, and it should be understood that each unit included in the first device 900 may be a software unit, or a hardware unit, or may be a part of a software unit and a part of a hardware unit.

Fig. 10 is a schematic structural diagram of a second device provided in the present application, where the second device is configured to execute the fault information association reporting method for the distributed system described in fig. 3. The division of the functional units of the second device is not limited, and each unit in the second device may be increased, decreased or combined as required. In addition, the operations and/or functions of each unit in the second device are respectively for implementing the corresponding flow of the method described in fig. 3, and are not described herein for brevity. Fig. 10 exemplarily provides a division of functional units:

the second device 1000 includes a second buffer unit 1010, a first receiving unit 1020, and a second processing unit 1030.

And a second caching unit 1010, configured to cache a second call relationship, where the second call relationship includes device call information of one or more distributed services that the second device initiated by the user participates in processing.

And a first receiving unit 1020, configured to receive a first notification sent by a first device, where the first notification is used to instruct the second device to report second fault information to the server.

A second processing unit 1030, where the second device reports second fault information to the server; or under the condition that the second equipment does not report the second fault information, the second equipment reports the second fault information to the server; the second fault information includes fault information of the second device when processing the first distributed service.

In one possible implementation, the second device 1000 further includes: a second sending unit 1040, configured to send a second notification to the third device, where the second notification is used to instruct the third device to report third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

The four units can mutually perform data transmission through a communication channel, and it should be understood that each unit included in the second device 1000 may be a software unit, a hardware unit, or a software unit and a hardware unit.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computing device according to an embodiment of the present application. As shown in fig. 11, the computing device 1100 includes: processor 1110, communication interface 1120, and memory 1130, where processor 1110, communication interface 1120, and memory 1130 are interconnected by internal bus 1140.

The computing device 1100 may be the first device 900 of fig. 9, with the functions performed by the first device 900 of fig. 9 being performed in effect by the processor 1110 of the first device 900.

The processor 1110 may be comprised of one or more general purpose processors, such as a central processing unit (central processing unit, CPU), or a combination of CPU and hardware chips. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof.

The communication interface 1120 is used to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN), etc.

Bus 1140 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 11, but not only one bus or one type of bus.

Memory 1130 may include volatile memory (RAM), such as random access memory (random access memory); the memory 1130 may also include a non-volatile memory (ROM), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD), or a Solid State Drive (SSD); memory 1130 may also include combinations of the above. The memory 1130 is configured to store program codes for executing the method embodiment for reporting fault information related to the distributed system, where in an embodiment, the memory 1130 may also cache other data and be controlled by the processor 1110 to execute the functional units shown in the first device 900, or implement the method steps in the method embodiment shown in fig. 3, where the method embodiment uses the first device 900 as an execution body. The method comprises the following steps:

processor 1110 controls memory 1130 to cache a first call relationship comprising device call information for one or more distributed services that a user initiated first device participates in processing;

when the first device 900 processes the first distributed service to fail, the processor 1110 controls the communication interface 1120 to report the first failure information to the server;

Processor 1110 looks up the second device from the first call relationship; the second device includes a device that invokes the first device 900 or is invoked by the first device 900 when executing the first distributed service;

processor 1110 controls communication interface 1120 to send a first notification to the second device, where the first notification is used to instruct the second device to report second fault information to the server; or, in the case that the second device does not report the second fault information, the first device 900 sends the first notification to the second device; the second fault information includes fault information of the second device when processing the first distributed service.

In one implementation, the processor 1110 searches for the second device from the first call relationship, including: processor 1110 looks up the second device from the first device call information by the first identification; the first device call information is device call information of the first distributed service in the first call relationship.

In one implementation, the processor 1110 divides the first call relationship into at least two states according to a service lifecycle; the service life cycle of the corresponding service of the first calling relation of different states is different; processor 1110 separately caches the first call relationships for the different states in memory 1130.

In one implementation, if the first device 900 sends the first notification to the second device, the sending fails, and the memory 1130 caches a first association failure event; the first association failure event is used for representing that the first notification is failed to be sent.

In one implementation, processor 1110 determines whether there is sufficient buffer space to buffer the first association failure event; when there is currently sufficient buffer space to buffer the first association failure event, processor 1110 controls memory 1130 to buffer the first association failure event; when there is currently insufficient buffer space to buffer the first association failure event, processor 1110 clears the second association failure event; the second association failure event is the association failure event with the longest cache time in the first equipment; if there is enough buffer space to buffer the first association failure event after the second association failure event is cleared, the processor 1110 controls the memory 1130 to buffer the first association failure event; if there is still insufficient buffer space for buffering the first association failure event after clearing the second association failure event, processor 1110 clears a third association failure event; and after the second association failure event is cleared, the association failure event with the longest caching time in the first device is obtained.

In one implementation, when the first device 900 is re-online or is again processing distributed traffic, the processor 1110 checks whether the first association failure event exists; when the first association failure event exists, the processor 1110 controls the communication interface 1120 to send the first notification to the second device; upon successful transmission of the first notification, processor 1110 clears the cached first association failure event.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a computing device according to an embodiment of the present application. As shown in fig. 12, the computing device 1200 includes: processor 1210, communication interface 1220 and memory 1230, said processor 1210, communication interface 1220 and memory 1230 being interconnected by an internal bus 1240.

The computing device 1200 may be the second device 1000 of fig. 12, with the functions performed by the second device 1000 of fig. 10 being performed in effect by the processor 1210 of the second device 1000.

The processor 1210 may be comprised of one or more general purpose processors, such as a central processing unit (central processing unit, CPU), or a combination of CPU and hardware chips. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof.

The communication interface 1220 is used to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN), etc.

Bus 1240 can be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus 1240 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 12, but not only one bus or one type of bus.

Memory 1230 may include volatile memory (RAM), such as random access memory (random access memory); the memory 1230 may also include a nonvolatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a hard disk (HDD), or a Solid State Drive (SSD); the memory 1230 may also include combinations of the above. The memory 1230 is configured to store program codes for performing the method for reporting fault information associated with the distributed system, where in an embodiment, the memory 1230 may also cache other data and be controlled by the processor 1210 to perform the functional units shown in the second device 1000, or perform the method steps in the method embodiment shown in fig. 3, where the second device 1000 is the main body of execution. The method comprises the following steps:

Processor 1210 controls memory 1230 to cache a second call relationship comprising device call information for one or more distributed services initiated by a user to be engaged in processing by second device 1000;

the processor 1210 in the second device 1000 receives, through the control communication interface 1220, a first notification sent by the first device, where the first notification is used to instruct the second device to report second fault information to the server;

processor 1210 reports the second failure information to the server; or, in the case that the second device 1000 does not report the second fault information, the second device 1000 reports the second fault information to the server; the second failure information includes failure information when the second device 1000 processes the first distributed service.

In one implementation, the processor 1210 searches for a third device from the second call relationship; the third device includes a device that invokes the second device 1000 or is invoked by the second device 1000 when executing the first distributed service; processor 1210 controls communication interface 1220 to send a second notification to the third device, the second notification being for instructing the third device to report third failure information to the server; or, in the case that the third device does not report the third fault information, the second device 1000 sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

In one implementation, the processor 1210 searches for the third device from the second device call information by the first identifier; the second device call information is the device call information of the first distributed service in the second call relationship.

In one implementation, the processor 1210 divides the second call relationship into at least two states according to a service life cycle; the service life cycle of the corresponding second calling relation of different states is different; processor 1210 separately caches the second call relationship for the different state in memory 1230.

In one implementation, if the processor 1210 fails to send the second notification to the third device through the control communication interface 1220, the processor 1210 controls the memory 1230 to buffer the fourth association failure event; the fourth association failure event is used for representing that the second notification fails to be sent.

In one implementation, processor 1210 determines whether there is sufficient buffer space to buffer the fourth association failure event; when there is currently sufficient buffer space to buffer the fourth association failure event, processor 1210 controls memory 1230 to buffer the fourth association failure event; when there is currently insufficient buffer space to buffer the fourth association failure event, processor 1210 clears the fifth association failure event; the fifth association failure event is the association failure event with the longest cache time in the second device; if there is enough buffer space to buffer the fourth association failure event after the fifth association failure event is cleared, processor 1210 controls memory 1230 to buffer the fourth association failure event; if there is insufficient buffer space for buffering the fourth association failure event after clearing the fifth association failure event, processor 1210 clears the sixth association failure event; and after the fifth association failure event is cleared, the association failure event with the longest caching time in the second device.

In one implementation, when the second device 1000 is re-online or is again processing distributed traffic, the processor 1210 checks whether the fourth association failure event exists; when the fourth association failure event exists, processor 1210 controls memory 1230 to send the second notification to the third device; upon successful transmission of the second notification, processor 1210 clears the cached fourth association failure event.

The present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, can implement some or all of the steps of any one of the above-described method embodiments, and implement the functions of any one of the functional units described in fig. 9.

The present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, can implement some or all of the steps of any one of the above-described method embodiments, and implement the functions of any one of the functional units described in fig. 10.

Embodiments of the present application also provide a computer program product which, when run on a computer or processor, causes the computer or processor to perform one or more of the method steps of any of the methods described above, mainly the first device 900. The respective constituent modules of the above-mentioned apparatus may be stored in the computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.

Embodiments of the present application also provide a computer program product which, when run on a computer or processor, causes the computer or processor to perform one or more of the method steps of any of the methods described above, mainly the second device 1000. The respective constituent modules of the above-mentioned apparatus may be stored in the computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.

The embodiment of the present application further provides a chip system, where the chip system includes a processor, and is configured to support the first device 900 to implement one or more steps of any one of the above methods, where the method steps are performed by the first device 900 as a main body. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the data transmission device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The embodiment of the present application further provides a chip system, where the chip system includes a processor, and is configured to support the second device 1000 to implement one or more steps of any one of the above methods, where the method steps of the second device 1000 are mainly performed. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the data transmission device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It should be understood that the first, second, third, fourth, and various numerical numbers referred to herein are merely descriptive convenience and are not intended to limit the scope of the present application.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should also be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. The fault information association reporting method for the distributed system is characterized by comprising the following steps of:

the first equipment caches a first calling relation, wherein the first calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by the first equipment;

when the first equipment processes the first distributed service to generate faults, only reporting first fault information to a server, wherein the first fault information comprises the fault information when the first equipment processes the first distributed service;

The first device searches a second device from the first calling relation; the second device comprises a device which calls the first device or a device which is called by the first device when the first distributed service is executed;

the first device sends a first notification to the second device, wherein the first notification is used for indicating the second device to report only second fault information to the server; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information; the second fault information comprises the fault information when the second device processes the first distributed service, and does not comprise the fault information when the first device processes the first distributed service.

2. The method of claim 1, wherein the first device looking up a second device from the first call relationship comprises:

the first device searches the second device from the first device calling information through the first identifier; the first device call information is device call information of the first distributed service in the first call relationship.

3. The method of claim 1 or 2, wherein caching the first call relationship comprises:

dividing the first calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding service of the first calling relation of different states is different;

and caching the first calling relations of the different states separately.

4. The method of claim 1, wherein the method further comprises:

if the first device sends the first notification to the second device, the first device fails to send and caches a first association failure event; the first association failure event is used for representing that the first notification is failed to be sent.

5. The method of claim 4, wherein caching the first association failure event comprises:

the first device determining whether there is sufficient buffer space for buffering the first association failure event;

when enough buffer space is currently available for buffering the first association failure event, buffering the first association failure event;

when no enough buffer space is currently available for buffering the first association failure event, clearing a second association failure event; the second association failure event is the association failure event with the longest cache time in the first equipment;

If the second association failure event is cleared, enough buffer space is available for buffering the first association failure event, and the first association failure event is buffered;

if the second association failure event is cleared, enough buffer space is still not available for buffering the first association failure event, and a third association failure event is cleared; and after the second association failure event is cleared, the association failure event with the longest caching time in the first device is obtained.

6. The method of claim 4 or 5, wherein the method further comprises:

checking whether the first association failure event exists when the first device is on line again or processes the distributed service again;

when the first association failure event exists, sending the first notification to the second device;

and when the first notification is successfully sent, clearing the cached first association failure event.

7. The fault information association reporting method for the distributed system is characterized by comprising the following steps of:

the second equipment caches a second calling relation, wherein the second calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by the second equipment;

The second device receives a first notification sent by the first device, wherein the first notification is used for indicating the second device to report only second fault information to a server;

the second equipment only reports second fault information to the server; or under the condition that the second equipment does not report the second fault information, the second equipment only reports the second fault information to the server; the second fault information comprises fault information when the second device processes the first distributed service, and does not comprise fault information when the first device processes the first distributed service.

8. The method of claim 7, wherein the method further comprises:

the second device searches a third device from the second calling relation; the third device comprises a device which calls the second device or a device which is called by the second device when the first distributed service is executed;

the second device sends a second notification to the third device, wherein the second notification is used for indicating the third device to report third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

9. The method of claim 8, wherein the second device looking up a third device from the second call relationship comprises:

the second device searches a third device from the second device call information through the first identifier; the second device call information is the device call information of the first distributed service in the second call relationship.

10. The method of any of claims 7-9, wherein caching the second call relationship comprises:

dividing the second calling relation into at least two states according to the service life cycle; the service life cycle of the corresponding second calling relation of different states is different;

and caching the second calling relation of the different states separately.

11. The method of claim 8, wherein the method further comprises:

if the second device sends the second notification to the third device, the sending fails, and a fourth association failure event is cached; the fourth association failure event is used for representing that the second notification fails to be sent.

12. The method of claim 11, wherein the caching of the fourth association failure event comprises:

The second device determining whether there is sufficient buffer space for buffering the fourth association failure event;

when enough buffer space is currently available for buffering the fourth association failure event, buffering the fourth association failure event;

when there is not enough buffer space for buffering the fourth association failure event, clearing the fifth association failure event; the fifth association failure event is the association failure event with the longest cache time in the second device;

if the fifth association failure event is cleared, enough buffer space is available for buffering the fourth association failure event, and the fourth association failure event is buffered;

if the fifth association failure event is cleared, enough buffer space is still not available for buffering the fourth association failure event, and a sixth association failure event is cleared; and after the fifth association failure event is cleared, the association failure event with the longest caching time in the second device.

13. The method of claim 11 or 12, wherein the method further comprises:

checking whether the fourth association failure event exists when the second device is on line again or processes the distributed service again;

When the fourth association failure event exists, sending the second notification to the third device;

and when the second notification is successfully sent, clearing the cached fourth association failure event.

14. A first device, comprising:

the first caching unit is used for caching a first calling relation, and the first calling relation comprises equipment calling information of one or more distributed services which are initiated by a user and are participated in processing by first equipment;

the first processing unit is used for reporting only first fault information to the server when the first equipment processes the first distributed service to generate faults, wherein the first fault information comprises the fault information when the first equipment processes the first distributed service; searching a second device from the first calling relation; the second device comprises a device which calls the first device or a device which is called by the first device when the first distributed service is executed;

a first sending unit, configured to send a first notification to the second device, where the first notification is used to instruct the second device to report only second fault information to the server; or the first device sends the first notification to the second device under the condition that the second device does not report the second fault information; the second fault information comprises the fault information when the second device processes the first distributed service, and does not comprise the fault information when the first device processes the first distributed service.

15. The device of claim 14, wherein the first processing unit is configured to, when searching for the second device from the first call relationship, specifically:

searching a second device from the first device call information through the first identifier; the first device call information is device call information of the first distributed service in the first call relationship.

16. The device according to claim 14 or 15, wherein the first buffering unit is specifically configured to:

and caching the first calling relations of the different states separately.

17. The apparatus of claim 14, wherein the first cache unit is further to:

if the first sending unit sends the first notification to the second device, the sending fails, and a first association failure event is cached; the first association failure event is used for representing that the first notification is failed to be sent.

18. The apparatus of claim 17, wherein the first caching unit is configured to cache the first association failure event, specifically:

Determining whether there is sufficient buffer space to buffer the first association failure event;

19. The device according to claim 17 or 18, wherein the first processing unit is further configured to check whether the first association failure event exists when the first device is re-online or is again processing distributed traffic; the first sending unit is further configured to send the first notification to the second device when the first association failure event exists; the first buffer unit is further configured to clear the buffered first association failure event after the first sending unit successfully sends the first notification.

20. A second device, comprising:

the second cache unit is used for caching a second call relationship, and the second call relationship comprises equipment call information of one or more distributed services which are initiated by a user and are participated in processing by second equipment;

the first receiving unit is used for receiving a first notification sent by the first equipment, and the first notification is used for indicating the second equipment to report only the second fault information to the server;

the second processing unit only reports second fault information to the server by the second equipment; or under the condition that the second equipment does not report the second fault information, the second equipment only reports the second fault information to the server; the second fault information comprises fault information when the second device processes the first distributed service, and does not comprise fault information when the first device processes the first distributed service.

21. The device of claim 20, wherein the second processing unit is further configured to find a third device from the second call relationship, the third device comprising a device that calls or is called by the second device when executing the first distributed service;

The second device further comprises a second sending unit, configured to send a second notification to the third device, where the second notification is used to instruct the third device to report third fault information to the server; or, if the third device does not report the third fault information, the second device sends the second notification to the third device; the third fault information includes fault information when the third device processes the first distributed service.

22. The device of claim 21, wherein the second processing unit is configured to, when searching for a third device from the second call relationship, specifically:

searching a third device from the second device call information through the first identifier; the second device call information is the device call information of the first distributed service in the second call relationship.

23. The apparatus as claimed in any one of claims 20 to 22, wherein the second caching unit is configured to cache the second call relationship, specifically for:

And caching the second calling relation of the different states separately.

24. The apparatus of claim 21, wherein the second cache unit is further to:

if the second sending unit sends the second notification to the third device, the sending fails, and a fourth association failure event is cached; the fourth association failure event is used for representing that the second notification fails to be sent.

25. The apparatus of claim 24, wherein the second caching unit is configured to cache the fourth association failure event, specifically:

determining whether there is sufficient buffer space to buffer the fourth association failure event;

26. The device of claim 24 or 25, wherein the second processing unit is further configured to check whether the fourth association failure event exists when the second device is re-online or is again processing distributed traffic;

the second sending unit is further configured to send the second notification to the third device when the fourth association failure event exists; and when the second notification is successfully sent, clearing the cached fourth association failure event.

27. A computing device comprising a memory and a processor that executes computer instructions stored in the memory, causing the computing device to perform the method of any one of claims 1-13.

28. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-13.

29. A chip system, comprising at least one processor, a memory and an interface circuit, wherein the memory, the interface circuit and the at least one processor are interconnected by a line, and wherein the at least one memory has instructions stored therein; the instructions, when executed by the processor, implement the method of any of claims 1-13.