CN113656207A

CN113656207A - Fault processing method, device, electronic equipment and medium

Info

Publication number: CN113656207A
Application number: CN202110937028.8A
Authority: CN
Inventors: 楚振江; 李建均; 宋晓东
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-11-16
Anticipated expiration: 2041-08-16
Also published as: CN113656207B

Abstract

The disclosure provides a fault processing method, a fault processing device, electronic equipment and a medium, relates to the technical field of computers, and particularly relates to a service operation maintenance and cloud platform. A fault handling method comprises the following steps: obtaining operation data, wherein the operation data comprises data of the name service from the first module; in response to determining that the first instance of the first module failed based on the operational data, determining a failure handling event; and executing the failure handling event on the first instance.

Description

Fault processing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a service operation maintenance and cloud platform, and in particular, to a fault handling method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

In a computer system, when a fault or an abnormality occurs in an online service instance, if the fault is not identified and processed in time, the service stability of the whole system is affected. How to reduce labor cost and improve the automatic fault handling capability becomes a technical problem of computer operation and maintenance which needs to be solved urgently.

It is desirable to have a method of identifying and handling faulty instances when they occur in a computer system.

Disclosure of Invention

The present disclosure provides a fault handling method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a fault handling method including: obtaining operational data, the operational data including data from a name service of a first module; determining a failure handling event in response to determining that a first instance of the first module failed based on the operational data; and executing the fault handling event on the first instance.

According to another aspect of the present disclosure, there is provided a fault handling apparatus including: an operation data acquisition unit configured to acquire operation data including data of a name service from the first module; a fault event determination unit to determine a fault handling event in response to determining that a first instance of the first module is faulty based on the operational data; and a fault handling unit for executing the fault handling event on the first instance.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a fault handling method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a fault handling method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a fault handling method according to embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, faults in system services may be accurately and efficiently determined and handled.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a fault handling method according to an embodiment of the present disclosure;

3A-3C illustrate various scenario diagrams for fault identification and handling according to embodiments of the present disclosure;

FIG. 4 shows a data flow diagram during fault handling according to an embodiment of the present disclosure;

fig. 5 shows a block diagram of a fault handling apparatus according to an embodiment of the present disclosure; and is

FIG. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the fault handling methods according to embodiments of the present disclosure to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use

client devices

101, 102, 103, 104, 105, and/or 106 to control the operation of computer services, view fault identification and processing results, and so forth. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, Apple iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., Google Chrome OS); or include various Mobile operating systems, such as Microsoft Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In certain embodiments, the data store used by the server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

A fault handling method 200 according to an embodiment of the present disclosure is described below with reference to fig. 2.

At step 210, obtaining operational data, the operational data including data from the name service of the first module;

at step 220, a failure handling event is determined in response to determining that the first instance of the first module failed based on the operational data.

At step 230, a failure handling event is performed on the first instance.

By employing name service as a fault monitoring indicator in accordance with the method 200 of the present disclosure, faults in instances can be effectively identified.

Name services are service functions that are used to describe instances of a module, can provide service name based resolution capabilities for addressing between large-scale computer programs, and can return instance state data under the module. By using the name service as a supervision indicator of an example, the real calling situation in a module of the computer system can be better fed back. For example, the collection of simple index data may have problems with the collected data not being in an instance or with the instance being closed; and the name service collects the instances, so that the operation conditions of the instances in the current system can be reflected more truly. In addition, there are cases where various simple indexes such as CPU and the like have no abnormality, but the name service state is abnormal. Therefore, according to the embodiments of the present disclosure, by using the data of the name service, more flexible, comprehensive and efficient fault identification and fault handling can be obtained.

In a complex system including a system of computer cloud services modules, there may be hundreds or thousands of modules. Particularly, with the gradual development of the internet cloud technology of the computer, more and more cloud services adopt micro-service architectures, so that the functions of the modules are decoupled and split, and huge module quantity and complex system architectures are caused. Each module consists of computer program instances having the same functionality, and thus in complex systems, computer program instances may reach a scale on the order of hundreds of thousands to millions. Therefore, failures and anomalies of online service instances may frequently occur, which, if not identified and handled in a timely manner, may affect the service stability of the overall system. How to reduce labor cost and improve the automatic fault handling capability becomes a technical problem of computer operation and maintenance which needs to be solved urgently.

According to the embodiment of the disclosure, firstly, service faults of a computer program can be sensed and service state indexes can be collected, a scene of the service faults is analyzed by processing index data, decision scheme matching confirmation is carried out on the identified service faults, and after the processing scheme of the service faults is determined, the processing scheme is handed to a service fault handling stage to finish automatic processing of the faults of the computer program service. An automated solution to such failures without the need for manual validation may be referred to as a "self-healing" problem. The method according to the embodiment of the disclosure can be applied to fault identification and fault processing of various service systems, including but not limited to a search engine system, an information flow recommendation system, a cloud service system, and the like. It is to be understood that the present disclosure is not limited thereto.

As shown in FIG. 3A, the multiple instances of the computer system include instances that are functional and instances that are dysfunctional. In the case where there is an error in the fault recognition, the instances covered by the fault recognition may include an abnormal instance (correct recognition, corresponding to the area S1) and a normal instance (misidentification, corresponding to the area S2), and there may also be an abnormal instance (unrecognized, corresponding to the area S3) and a normal instance (corresponding to the area S4) in the instances not covered by the fault recognition. Scenario coverage refers to the ratio of the number of instances covered by the fault identification relative to the number of total instances in the system. Self-healing recall refers to the proportion of abnormal instances identified by a fault relative to the total abnormal instances in the system. Self-healing accuracy refers to the proportion of all instances covered by fault identification that are occupied by abnormal instances. As shown in fig. 3A, the scene coverage rate is (S1+ S2)/(S1+ S2+ S3+ S4), the self-healing recall rate is S1/(S1+ S3), and the self-healing accuracy rate is S1/(S1+ S2). In the service fault identification and processing problem, the scene coverage rate, the self-healing recall rate and the self-healing accuracy rate are expected to be improved, so that the abnormal instances can be identified more accurately, the automatic processing is carried out, and the overall realization function of the computer service is normal.

An example relationship of modules and instances in a computer system is described with reference to FIG. 3B. In fig. 3B a first module 321 named program service a is shown, with two

instances

3211 and 3212, and a second module 321 named program service B is shown, with three instances 3221, 3222 and 3223. It is understood that the number of modules and the number of instances are examples. The above data may be obtained by querying the name service function of each module. Thereafter, the data for each instance obtained by the name service may be analyzed to determine if there is an exception in the instance, and if so, processed in accordance with the fault handling decision.

Variations of fault handling methods according to some embodiments of the present disclosure are described below.

According to some embodiments, determining that the first instance failed based on the operational data includes determining a duration of a current exception state in response to the data from the name service of the first module indicating that a latest state of the first instance is an exception state. Subsequently, a determination is made that the first instance failed in response to the duration exceeding a first threshold.

In general, the data of the name service indicates a single state of the instance, such as the current state of the instance or the state at the time corresponding to the timestamp. For example, the data for the name service may include a service state or state code defined in the name service, such as values of 0, 1, 2, etc. to identify different states, and it is understood that the disclosure is not limited thereto. Therefore, by increasing the state duration for supervision, the accuracy of fault identification can be further ensured. For example, the historical state of the instance may be queried and the instance considered to have failed if the current abnormal state persists beyond a certain time threshold (e.g., 3 hours), or alternatively, exceeds a certain number of times threshold (e.g., three consecutive times are abnormal at a statistical once per hour interval). It is to be understood that the thresholds herein are merely examples, and one skilled in the art may set larger or smaller thresholds, or different thresholds for different modules or different instances (e.g., depending on the criticality or importance of the module, etc.).

According to some embodiments, determining that the first instance failed based on the operational data includes determining that the first instance failed in response to determining that the first module further includes at least one second instance and that the data from the name service of the first module indicates that a difference in the state value of the first instance and the state value of the at least one second instance is greater than a second threshold.

The abnormal fault decision identification can be found and carried out by calculating and identifying whether the examples with different performances from most of the examples exist in the various examples under the same service name. Continuing with the example in FIG. 3B, assume that after reading the data by the name service, three instances under the service name B are acquired, and by analyzing the metric data of each instance, it is found that the metric data of instance 3222 is very different from instances 3221 and 3223. In such a case, the second instance 3222 in the second module may be considered an exception instance. In the case of using the state code to characterize the instance state, the second threshold may be a certain state code interval, or the second threshold may be set according to a certain anomaly level gap. For example, as one example, if a value of 0 indicates a normal state, a value of 1 indicates a slightly abnormal state, a value of 2 indicates a moderately abnormal state, and so on, the second threshold may be a state value difference of 2 or more. It is to be understood that the present disclosure is not limited thereto. According to the embodiment, the fault identification and processing strategy based on the peripheral view of the instance can be realized based on the balance among all the instances in the name service.

According to some embodiments, the operational data further includes error reporting data from a third instance of the second module, and wherein determining that the first instance failed based on the operational data includes: in response to determining from the network topology data that the error data from the third instance indicates an anomaly associated with the first instance, determining that the first instance failed.

The diagnosis is carried out through error reporting data of other examples, the identification of example faults can be realized from the aspects of data flow and service effect, and the effect of more comprehensive service fault judgment is achieved. And under the condition that the error information of the other instance points to the current instance according to the network topology data and the error information of the other instance, the fault of the current instance can be effectively identified.

For example, referring to FIG. 3C, an error is received from the first instance 3311 of the first module 331 indicating an exception to the data or instructions sent to the third instance 3323 of the downstream second module 332. Thus, instance 3323 may be considered to have failed. As another example, instance 3311 may also be considered to have failed if an error is received from the third instance 3323 of the second module 332 for instance 3311. It is to be understood that this is by way of example only and that the disclosure is not limited thereto.

The network topology data may be obtained using a service mesh technology. The service grid is deployed in a computer cloud program service to solve service module call relation management in a complex system along with the development of cloud native technology and micro service. For example, a service grid agent may be deployed in each programmatic service, and a service grid control center may be deployed externally, and the grid agent for each programmatic service may be connected to and routed by the control center to implement invocation and access relationships between modules or instances. It will be appreciated that other deployment and topological data acquisition approaches are possible. With continued reference to FIG. 3C, three modules 331, 332, and 333 in a computer system are shown from a service topology perspective, where a first module 331 has two instances 3311 and 3312, a second module 332 has three instances 3321, 3322, and 3323, and a third module 333 has two instances 3331 and 3332. The first module may invoke the second module, and the second module may invoke the third module. It is to be understood that the number of modules, the number of instances, the upstream and downstream relationships between modules, and the manner of connection between instances herein are examples, and the disclosure is not limited thereto.

According to some embodiments, the operational data includes error reporting data from a plurality of instances, and wherein determining that the first instance failed based on the operational data includes: determining that the first instance failed in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

In the case of a complex network topology, where the first instance has multiple associated instances, majority upstream reporting and voting decisions may be employed. Observing and discovering anomalous instances from upstream can be much simpler and more intuitive. For example, if two of the three upstream instances report an exception associated with a particular instance ("first instance"), the first instance is considered to have failed. With continued reference to the example of FIG. 3C, two instances of the first module 331 access three instances of the downstream second module 332, and when an abnormal condition occurs in the instance 3323 of the second module that cannot properly serve the functions of the computer program, the instances 3311 of the first module count and feed back the abnormal condition. Similarly, the instance 3312 of the first module reports an abnormal condition when invoking the instance 3323. Thus, exception feedback is obtained for multiple instances of the upstream module to a particular instance of the downstream module, and based on majority upstream reporting and voting decisions, an instance 3323 of the second module is identified as an exception instance. As another example, assuming that only one instance (e.g., instance 3312) reported an error for instance 3323, while other instances (e.g., instances of other calling instances 3323 not shown in the figures) did not report an error, the instance 3323 may be considered to have not failed based on a voting decision or a majority reporting decision. It is to be understood that the above description of the number of modules, the number of instances, and the "majority" determination conditions is merely an example, and the present disclosure is not limited thereto.

According to some embodiments, the operational data further includes metric data for the first instance, and wherein determining that the first instance failed based on the operational data includes: in response to the indicator data for the first instance being above a third threshold, it is determined that the first instance is malfunctioning. The metric data may include, but is not limited to, a duration (e.g., a timeout indicates an exception), an occupied resource (e.g., a CPU or memory exceeding a predetermined occupancy rate indicates an exception), a return value (e.g., a return value indicates an error or an expected return value is not obtained indicates an exception), etc., and it is to be understood that the disclosure is not so limited. Determining that an instance has failed based on operational data that includes the indicator data may also include using historical indicator data, module instance configurations, semantic indicators, etc. of the instance. The abnormal conditions of the service instances are identified and discovered through the index data, and the service instances can be identified and processed quickly. By combining the index data with the data of the name service (and optionally, error reporting data of other instances, etc.), the fault can be more comprehensively and accurately identified and processed.

The above-mentioned operation data (for example, the operation data including one or more of index data, name service data, and topology and error reporting data) may be collected and processed in real time, or may be collected and then temporarily stored in a memory, and then summarized, counted, and analyzed to determine a fault instance and a fault handling event if a trigger condition is met (for example, periodically triggered). The periodic triggering may include periodically (e.g., every ten minutes, hour, day, etc.) pulling data associated with modules in the system (e.g., the first and second modules described above) and analyzing the data to determine whether there are anomalies associated with the modules and their instances.

For example, the data may be periodically read to determine whether there is an abnormality reported by the upstream and downstream modules of the modules. Such a policy may be applicable, for example, when the data reported each time is not too critical and no real-time exception mechanism is triggered.

As a further example, the triggering condition may include receiving relatively severe anomalous data (e.g., more than 50%), and thereby triggering a query and aggregation of the data for that or related modules to determine whether the fault identification policy is satisfied.

According to some embodiments, obtaining operational data includes collecting first operational data from at least one module including a first module; acquiring previously acquired second operation data in response to the value of the first operation data meeting the preliminary fault judgment condition, wherein the value of the second operation data does not meet the preliminary fault judgment condition; and using the first operational data and the second operational data as operational data. In other words, data that is not a significant violation by the preliminary determination (e.g., CPU only exceeds 10%, name service does not have a significant exception or a low level exception) may be stored. Thereafter, upon collection of operational data that significantly exceeds a threshold or anomaly (e.g., a very significant name service anomaly status value, a CPU occupancy value exceeding 40% or 50%, etc., multiple associated instances reporting errors, etc.), such data may be pushed directly to a fault identification unit for determination and event identification, and previously stored non-exceeding data read for aggregation and analysis.

According to some embodiments, the fault handling event comprises at least one of: repair instances, migrate instances directly, and stop service and migrate instances. Therefore, automatic control of faults can be achieved, manual intervention and labor cost are reduced, and fault processing efficiency is improved. Compared with direct migration, the method of stopping service and then migrating can have stronger timeliness. It will be appreciated that the above are merely examples, and that the disclosure is not limited thereto. Optionally, a security lock mechanism may be implemented on instances of the processing module, upper limit control over the number and proportion of exception instances processed automatically, and enable higher level security policies or output warnings when the upper limit is exceeded to indicate that human intervention is required, etc.

According to some embodiments, prior to executing the failure handling event on the first instance, the method further comprises: determining another fault handling event for another instance different from the first instance; in response to determining that the fault handling event and the another fault handling event satisfy the rollup condition, the fault handling event and the another fault handling event are merged. The aggregated conditions may be the same fault type (and thus the same or similar fault handling events). The aggregated conditions may also be instance locations or modules, etc. For example, the computed fault instances and fault events may be sent to a summary section or called a self-healing event bus, and then summarized, categorized, and distributed. For example, the entire service system may be deployed on another computing cloud different from the computing cloud to be managed, and use a different fault handler to be responsible for different modules/instances, etc. As an example of assigning fault handlers by fault handling event type, the handlers may include the following classifications: directly migrating the instance handler; service is stopped first and then the example processor is migrated; a machine failover processor; a multi-instance exception handler. It is to be understood that the present disclosure is not limited thereto.

The fault handling scheme according to one or more embodiments of the present disclosure may be combined with the program itself, for example, by embedding an operation data monitoring module or the like in the program module, and thus may not depend on a cloud service provider or a third party, thereby being capable of improving the efficiency of fault identification at least by reducing the transmission amount of data and the scale of the fault identification system.

The data flow during fault identification and handling according to an embodiment of the present disclosure is described below with reference to fig. 4. As shown in fig. 4, the system may include a data acquisition layer 410, an event decision layer 420, and an execution processing layer 430. It is to be understood that the partitioning herein is merely a functional example, and does not require such partitioning physically. As one example, two or all of the layers may be physically located together, or each of the layers may be physically separated into more functional modules and communicatively coupled to implement the functionality described herein. It will be understood by those skilled in the art that the present disclosure is not limited thereto.

The data collection layer 410 may be used to obtain various types of operational data from computer program services, including but not limited to metric data 411, topology data 412, name service data 413, and the like. As previously described, the data collection layer may simply aggregate and process the operational data (e.g., via index collection aggregation 414) before storing in database 415; alternatively, the data collection layer may push data directly to the event decision layer 420 according to a simple policy, priority, or the like, e.g., send operational data of high failure probability to the event decision layer 420, while storing indicators of low failure probability in the database 415 for later use.

Event decision layer 420 may further analyze the acquired operational data and compare the operational data to a predetermined fault handling policy, for example, using fault handling policy identification 421, to determine a fault handling event. For example, the metric data may be compared to a predetermined threshold. For topology data, most upstream reporting, voting decision and the like can be adopted. The current state value may be periodically checked against the name service data, compared to the state values of other instances under the same service name, and/or a determination may be made whether an instance is anomalous based on the duration of the state anomaly. Other data processing methods and exception decision strategy examples have been described above, and are not described herein again. The event decision layer may classify and identify fault scenarios according to the real-time collected operational data, operational data pre-stored in the database, and optional external signals 422 and other comprehensive information, and determine corresponding decision events according to the identified fault scenarios. The external signal may refer to a configuration signal from the outside, a manual setting, and the like. The failure handling decision event may then be sent to the execution handling layer 430.

The execution processing layer 430 may collect decision events of multiple instances (for example, by using the event collection center 431) and distribute the collected decision events to different fault handlers 432, and send an instruction so that the corresponding fault handlers perform operations such as maintenance, replacement, and service stop of the computer program by linking with the cloud service where the computer program is located. Therefore, fault identification and processing of the computer program can be realized, and self-healing of the service abnormality is realized. According to one or more embodiments, by using operation data different from traditional dimensions as a fault index, fault identification and service self-healing can be realized in the aspects of a computer program, service network topology upstream and downstream, name service periphery and the like, normal stability and reliability of functions of computer services are improved, and robustness of a complex computer system is further improved.

A fault handling apparatus 500 according to an example embodiment of the present disclosure is described below in conjunction with fig. 5. The apparatus 500 comprises: an operational data acquisition unit 510, a fault event determination unit 520 and a fault handling unit 530. The operation data obtaining unit 510 may be configured to obtain operation data including data of a name service from the first module. The fault event determination unit 520 may be configured to determine a fault handling event in response to determining that the first instance of the first module is faulty based on the operational data. The failure handling unit 530 may be configured to perform a failure handling event on the first instance.

According to some embodiments, the failure event determination unit 520 may include a unit to determine a duration of a current exception state in response to the data from the name service of the first module indicating that the latest state of the first instance is an exception state; and means for determining that the first instance failed in response to the duration exceeding a first threshold.

According to some embodiments, failure event determination unit 520 may include a unit to determine that the first instance failed in response to determining that the first module further includes at least one second instance and that the data from the name service of the first module indicates that a difference in the state value of the first instance and the state value of the at least one second instance is greater than a second threshold.

According to some embodiments, the operational data may further comprise error reporting data from a third instance of the second module, and wherein the fault event determination unit may comprise a unit for determining that the first instance failed in response to determining from the network topology data that the error reporting data from the third instance indicates an anomaly associated with the first instance.

According to some embodiments, the operational data may include error reporting data from a plurality of instances, and wherein the failure event determination unit may include a unit for determining that the first instance failed in response to an anomaly associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

According to some embodiments, the operational data may further include indicator data of the first instance, and wherein the failure event determination unit 520 may include a unit for determining that the first instance failed in response to the indicator data of the first instance being above a third threshold.

According to some embodiments, the operational data acquisition unit 510 may include a unit for collecting first operational data from at least one module including a first module; means for acquiring previously acquired second operating data in response to the value of the first operating data satisfying the preliminary fault determination condition, the value of the second operating data not satisfying the preliminary fault determination condition; and means for using the first operational data and the second operational data as operational data.

According to some embodiments, the apparatus 500 may further comprise means for determining another fault handling event for another instance different from the first instance; and means for merging the failure handling event with another failure handling event in response to determining that the failure handling event and the another failure handling event satisfy the rollup condition prior to executing the failure handling event on the first instance.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 6, a block diagram of a structure of an electronic device 600, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, and the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 608 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as a bluetooth (TM) device, an 1302.11 device, a WiFi device, a WiMax device, a cellular communication device, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the method 200 or a modification thereof. For example, in some embodiments, method 200, or variations thereof, etc., may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. One or more steps of the method 200 described above or a variant thereof, etc. may be performed when the computer program is loaded into the RAM 603 and executed by the computing unit 601. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the method 200 or variations thereof, and so on.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A fault handling method, comprising:

obtaining operational data, the operational data including data from a name service of a first module;

determining a failure handling event in response to determining that a first instance of the first module failed based on the operational data; and

executing the failure handling event on the first instance.

2. The method of claim 1, wherein determining that the first instance failed based on the operational data comprises:

in response to the data from the name service of the first module indicating that the latest state of the first instance is an exception state, determining a duration of the exception state; and is

Determining that the first instance failed in response to the duration exceeding a first threshold.

3. The method of claim 1, wherein determining that the first instance failed based on the operational data comprises:

determining that the first instance failed in response to determining that the first module further includes at least one second instance and that data from the name service of the first module indicates that a difference in a state value of the first instance and a state value of the at least one second instance is greater than a second threshold.

4. The method of any of claims 1-3, wherein the operational data further comprises error reporting data from a third instance of a second module, and wherein determining that the first instance failed based on the operational data comprises:

determining that the first instance failed in response to determining from network topology data that error reporting data from the third instance indicates an anomaly associated with the first instance.

5. The method of any of claims 1-3, wherein the operational data includes error reporting data from a plurality of instances, and wherein determining that the first instance failed based on the operational data comprises:

determining that the first instance failed in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

6. The method of any of claims 1-5, wherein the operational data further comprises metric data for the first instance, and wherein determining that the first instance failed based on the operational data comprises:

determining that the first instance failed in response to the metric data for the first instance being above a third threshold.

7. The method of any of claims 1-6, wherein obtaining operational data comprises:

collecting first operational data from at least one module including the first module;

acquiring previously acquired second operation data in response to the value of the first operation data meeting a preliminary fault judgment condition, wherein the value of the second operation data does not meet the preliminary fault judgment condition; and

using the first operational data and the second operational data as the operational data.

8. The method of any of claims 1-7, wherein prior to executing the failure handling event on the first instance, the method further comprises:

determining another fault handling event for another instance different from the first instance;

merging the fault handling event with the other fault handling event in response to determining that the fault handling event and the other fault handling event satisfy a rollup condition.

9. The method of any of claims 1-8, wherein the fault handling event comprises at least one of: repair instances, migrate instances directly, and stop service and migrate instances.

10. A fault handling device comprising:

an operation data acquisition unit configured to acquire operation data including data of a name service from the first module;

a fault event determination unit to determine a fault handling event in response to determining that a first instance of the first module is faulty based on the operational data; and

a fault handling unit to execute the fault handling event on the first instance.

11. The apparatus of claim 10, wherein the failure event determination unit comprises:

means for determining a duration of an exception state in response to the data from the name service of the first module indicating that the latest state of the first instance is the exception state; and

means for determining that the first instance failed in response to the duration exceeding a first threshold.

12. The apparatus of claim 10, wherein the failure event determination unit comprises:

means for determining that the first instance failed in response to determining that the first module further includes at least one second instance and that data of the name service from the first module indicates that a difference in a state value of the first instance and a state value of the at least one second instance is greater than a second threshold.

13. The apparatus according to any of claims 10-12, wherein the operational data further comprises error reporting data from a third instance of a second module, and wherein the fault event determination unit comprises:

means for determining that the first instance failed in response to determining from network topology data that error reporting data from the third instance indicates an anomaly associated with the first instance.

14. The apparatus of any of claims 10-12, wherein the operational data comprises error reporting data from a plurality of instances, and wherein the fault event determination unit comprises:

means for determining that the first instance failed in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

15. The apparatus according to any of claims 10-14, wherein the operational data further comprises metric data of the first instance, and wherein the fault event determination unit comprises:

means for determining that the first instance failed in response to the metric data for the first instance being above a third threshold.

16. The apparatus according to any one of claims 10-15, wherein the operational data acquisition unit comprises:

means for collecting first operational data from at least one module including the first module;

means for obtaining previously collected second operating data in response to the value of the first operating data satisfying a preliminary fault determination condition, the value of the second operating data not satisfying the preliminary fault determination condition; and

means for using the first operational data and the second operational data as the operational data.

17. The apparatus of any one of claims 10-16, wherein the apparatus further comprises:

means for determining, for another instance different from the first instance, another fault handling event; and

means for merging the failure handling event with the other failure handling event in response to determining that the failure handling event and the other failure handling event satisfy an aggregation condition prior to executing the failure handling event on the first instance.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

20. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-9 when executed by a processor.