CN113656207B

CN113656207B - Fault processing method, device, electronic equipment and medium

Info

Publication number: CN113656207B
Application number: CN202110937028.8A
Authority: CN
Inventors: 楚振江; 李建均; 宋晓东
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2023-11-03
Anticipated expiration: 2041-08-16
Also published as: CN113656207A

Abstract

The disclosure provides a fault processing method, a device, electronic equipment and a medium, relates to the technical field of computers, and particularly relates to service operation maintenance and a cloud platform. The fault handling method comprises the following steps: acquiring operation data, wherein the operation data comprises data of name service from a first module; determining a fault handling event in response to determining that the first instance of the first module failed based on the operational data; and performing a fault handling event on the first instance.

Description

Fault processing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a service operation maintenance and cloud platform, and in particular, to a fault processing method, apparatus, electronic device, computer readable storage medium, and computer program product.

Background

In a computer system, when faults and anomalies occur in an online service instance, if such faults are not recognized and handled in time, the service stability of the overall system is affected. How to reduce the labor cost and improve the failure handling capability of automation becomes a computer operation and maintenance technical problem to be solved.

It is desirable to have a method of identifying and handling instances of a failure in a computer system when such instances occur.

Disclosure of Invention

The present disclosure provides a fault handling method, apparatus, electronic device, computer readable storage medium and computer program product.

According to an aspect of the present disclosure, there is provided a fault handling method, including: acquiring operation data, wherein the operation data comprises data of name service from a first module; determining a fault handling event in response to determining that the first instance of the first module failed based on the operational data; and executing the fault handling event on the first instance.

According to another aspect of the present disclosure, there is provided a fault handling apparatus including: an operation data acquisition unit configured to acquire operation data including data of a name service from a first module; a failure event determination unit configured to determine a failure processing event in response to determining that the first instance of the first module fails based on the operation data; and a fault handling unit configured to execute the fault handling event on the first instance.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a fault handling method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a fault handling method according to an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a fault handling method according to an embodiment of the present disclosure.

According to one or more embodiments of the present disclosure, faults in system services may be accurately and efficiently determined and handled.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a fault handling method according to an embodiment of the present disclosure;

3A-3C illustrate various scene graphs of fault identification and handling according to embodiments of the present disclosure;

FIG. 4 shows a schematic diagram of data flow during fault handling according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a fault handling apparatus according to an embodiment of the present disclosure; and is also provided with

Fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the fault handling methods according to embodiments of the present disclosure.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 to control the operation of computer services, view fault identification and processing results, and so on. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as Microsoft Windows, apple iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., google Chrome OS); or include various mobile operating systems such as Microsoft Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

A fault handling method 200 according to an embodiment of the present disclosure is described below with reference to fig. 2.

At step 210, operational data is obtained, the operational data including data from a name service of a first module;

at step 220, a fault handling event is determined in response to determining that the first instance of the first module failed based on the operational data.

At step 230, a fault handling event is performed on the first instance.

By employing a name service as a fault monitoring indicator in accordance with the method 200 of the present disclosure, faults in an instance can be effectively identified.

Name services are service functions used to describe instances of a module, can provide service name-based resolution capabilities for addressing between large-scale computer programs, and can return instance state data under the module. By using the name service as a supervision index of the example, the real calling condition in the module of the computer system can be better fed back. For example, the collection of simple index data may have the problem that the collected data is not in an instance or that an instance has been closed; and the name service is used for collecting the examples, so that the running condition of each example in the current system can be reflected more truly. In addition, there are also cases where various simple indexes such as CPU and the like are not abnormal, but the name service state is abnormal. Thus, by utilizing data of a name service, more flexible, comprehensive and efficient fault identification and fault handling can be obtained according to embodiments of the present disclosure.

In complex systems, including computer cloud service module systems, there may be hundreds or thousands of module formations. In particular, with the gradual development of computer internet cloud technology, more and more cloud services adopt a micro-service architecture, which results in functional decoupling and splitting between modules, and a huge number of modules and a complex system architecture. Each module consists of computer program instances with the same functionality, and thus in complex systems, computer program instances may reach scales on the order of hundreds of thousands to millions. Therefore, failures and abnormal situations of online service instances may frequently occur, and if such failures are not recognized and handled in time, the service stability of the overall system may be affected. How to reduce the labor cost and improve the failure handling capability of automation becomes a computer operation and maintenance technical problem to be solved.

According to the embodiment of the disclosure, service faults of the computer program can be sensed and service state indexes are collected, the scene of the service faults is analyzed by processing the index data, decision scheme matching confirmation is carried out on the identified service faults, and after the processing scheme of the service faults is determined, the service faults are submitted to a service fault processing stage, so that the automatic fault processing of the computer program service is completed. An automated solution to such failures may be referred to as a "self-healing" problem without requiring manual confirmation. The method according to the embodiment of the present disclosure can be applied to fault identification and fault handling of various service systems, including but not limited to search engine systems, information flow recommendation systems, cloud service systems, and the like. It is to be understood that the present disclosure is not so limited.

As shown in FIG. 3A, the plurality of instances of the computer system includes instances of normal function and instances of abnormal function. In the case where there is an error in the failure recognition, the covered examples of the failure recognition may include an abnormal example (correctly recognized, corresponding to the area S1) and a normal example (misrecognized, corresponding to the area S2), and in the case where the failure recognition does not cover, there may be an abnormal example (unrecognized, corresponding to the area S3) and a normal example (corresponding to the area S4) as well. Scenario coverage refers to the ratio of the number of instances covered by the failure recognition relative to the number of all instances in the system. The self-healing recall refers to the proportion of abnormal instances of fault identification relative to all abnormal instances in the system. The self-healing accuracy refers to the proportion of abnormal examples in all examples covered by fault identification. As shown in fig. 3A, scene coverage= (s1+s2)/(s1+s2+s3+s4), self-healing recall=s1/(s1+s3), and self-healing accuracy=s1/(s1+s2). In the problems of service fault identification and processing, the scene coverage rate, the self-healing recall rate and the self-healing accuracy rate are expected to be improved, so that abnormal instances can be more accurately identified, automatic processing is performed, and the whole computer service is enabled to realize normal functions.

An example relationship of modules and instances in a computer system is described with reference to FIG. 3B. A first module 321, named program service a, is shown in fig. 3B, with two instances 3211 and 3212, and a second module 321, named program service B, is also shown, with three instances 3221, 3222, and 3223. It is understood that the number of modules and the number of instances herein are examples. The above data may be obtained by querying the name service function of each module. Thereafter, the data of each instance obtained through the name service may be analyzed to determine if an anomaly exists in the instance, and if so, processed in accordance with the fault handling decision.

Variations of fault handling methods according to some embodiments of the present disclosure are described below.

According to some embodiments, determining that the first instance failed based on the operational data includes determining a duration of a current exception state in response to data from the name service of the first module indicating that a most current state of the first instance is an exception state. Then, responsive to the duration exceeding a first threshold, it is determined that the first instance failed.

In general, the data of a name service indicates a single state of an instance, such as the current state of the instance or the state of a time of day corresponding to a timestamp. For example, the data of the name service may include service states or state codes defined in the name service, e.g., 0, 1, 2, etc., to identify the different states, and it is understood that the present disclosure is not limited thereto. Therefore, by monitoring by increasing the state duration, the accuracy of fault identification can be further ensured. For example, the historical state of an instance may be queried and an instance is deemed to have failed if the current abnormal state persists beyond a certain duration threshold (e.g., 3 hours), or alternatively, exceeds a certain number of times threshold (e.g., at an interval of one statistics per hour, all three consecutive times are abnormal). It is to be understood that the thresholds herein are merely examples and that one skilled in the art may set larger or smaller thresholds or different thresholds for different modules or different instances (e.g., depending on how critical or importance the module is, etc.).

According to some embodiments, determining that the first instance failed based on the operational data includes determining that the first instance failed in response to determining that the first module further includes at least one second instance, and that the data from the name service of the first module indicates that a difference in a state value of the first instance and a state value of the at least one second instance is greater than a second threshold.

Abnormal fault decision identification can be found and performed by calculating and identifying whether there are instances that behave differently from most instances among the instances under the same service name. Continuing with the example in FIG. 3B, assume that after reading data by the name service, three instances under the service name B are acquired, and that by analyzing the index data of each instance, the index data of instance 3222 is found to be very different from instances 3221 and 3223. In such a case, the second instance 3222 in the second module may be regarded as an abnormal instance. In the case where the state code is used to characterize the instance state, the second threshold may be a certain state code interval, or the second threshold may be set according to a certain anomaly level gap. For example, as one example, if a value of 0 indicates a normal state, a value of 1 indicates a slightly abnormal state, a value of 2 indicates a moderately abnormal state, and so on, the second threshold may be a state value difference of 2 or more. It is to be understood that the present disclosure is not so limited. According to such embodiments, failure recognition and handling policies based on the peripheral view of an instance can be implemented based on trade-offs between all instances within a name service.

According to some embodiments, the operational data further comprises error-reporting data from a third instance of the second module, and wherein determining that the first instance failed based on the operational data comprises: in response to determining from the network topology data that the error-reporting data from the third instance indicates an anomaly associated with the first instance, determining that the first instance failed.

The fault data of other examples are used for diagnosis, so that the example faults can be identified from the angles of data flow and service effect, and the effect of more comprehensively judging the service faults is achieved. In the case that the error reporting information of another instance is found to point to the current instance based on the network topology data and the error reporting information of the other instance, the failure of the current instance can be effectively identified.

For example, referring to FIG. 3C, an error is received from the first instance 3311 of the first module 331 indicating that the data or instruction sent to the third instance 3323 of the second module 332 downstream is abnormal. Thus, example 3323 may be considered to have failed. As another example, instance 3311 may also be considered to be faulty if an error is received for instance 3311 from third instance 3323 of second module 332. It is to be understood that this is only an example and that the present disclosure is not limited thereto.

The network topology data may be acquired using a service grid technique. The service grid is deployed in a computer cloud program service to solve the problem of service module call relationship management in a complex system with the development of cloud native technology and micro-services. For example, a service grid agent may be deployed in each program service, and a service grid control center may be deployed outside, with the grid agent for each program service being connected to the control center, routed by the control center, to implement call and access relationships between modules or instances. It will be appreciated that other deployment approaches and topology data acquisition approaches are possible. With continued reference to fig. 3C, three modules 331, 332, and 333 in a computer system are shown from the perspective of the service topology, with a first module 331 having two instances 3311 and 3312, a second module 332 having three instances 3321, 3322, and 3323, and a third module 333 having two instances 3331 and 3332. The first module may call the second module and the second module may call the third module. It is to be understood that the number of modules, the number of instances, the upstream-downstream relationship between modules, and the manner of connection between instances are examples herein, and the disclosure is not limited thereto.

According to some embodiments, the operational data comprises error-reporting data from a plurality of instances, and wherein determining that the first instance failed based on the operational data comprises: determining that the first instance failed in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

In cases where the network topology is complex, the first instance has multiple associated instances, multiple upstream reporting and voting decisions may be employed. Observing and discovering instances of anomalies from upstream can be simpler and more intuitive. For example, if two of the three instances upstream report an exception associated with a particular instance ("first instance"), then the first instance is deemed to have failed. With continued reference to the example of FIG. 3C, two instances of the first module 331 access three instances of the downstream second module 332, and when an exception condition occurs in the instance 3323 of the second module, the instance 3311 of the first module will count and feed back the exception condition when the computer program function service cannot be provided normally. Similarly, instance 3312 of the first module would also report feedback on the exception condition when instance 3323 was invoked. Thus, exception feedback is obtained for a particular instance of a downstream module from instances of a plurality of upstream modules, and an instance 3323 of a second module is identified as an exception instance based on majority upstream reporting and voting decisions. As another example, assuming that only one instance (e.g., instance 3312) is misplaced for instance 3323, while other instances (e.g., instances of other calling instance 3323 not shown in the figures) are not misplaced, then instance 3323 may be deemed to have not failed based on voting decisions or majority reporting decisions. It is to be understood that the above description of the determination conditions of the number of modules, the number of examples, and the "majority" is merely an example, and the present disclosure is not limited thereto.

According to some embodiments, the operational data further comprises index data of the first instance, and wherein determining that the first instance failed based on the operational data comprises: responsive to the index data of the first instance being above a third threshold, it is determined that the first instance is malfunctioning. The metric data may include, but is not limited to, a time period (e.g., a timeout indicates an exception), an occupied resource (e.g., a CPU or memory indicates an exception if a predetermined occupancy rate is exceeded), a return value (e.g., a return value indicates an error or an expected return value is not obtained indicates an exception), etc., and it is understood that the present disclosure is not limited thereto. Determining that an instance fails based on operational data including index data may also include using historical index data for the instance, module instance configuration, semantic index, and so forth. The abnormal situation of the service instance is identified and found through the index data, so that the service instance can be identified and processed quickly. By combining the index data with the data of the name service (and optionally, the error reporting data of other instances, etc.), faults can be more comprehensively and accurately identified and processed.

The above-mentioned operation data (such as operation data including one or more of index data, name service data, and topology and error reporting data) may be collected and processed in real time, or may be collected and temporarily stored in a memory, and then summarized, counted and analyzed to determine a fault instance, a fault handling event, etc. if a trigger condition is satisfied (e.g., periodically triggered). Periodic triggering may include periodically (e.g., every ten minutes, every hour, every day, etc.) pulling data associated with modules in the system (e.g., the first and second modules described above) and analyzing the data to determine if anomalies associated with these modules and their instances exist.

For example, the data may be read periodically to determine whether or not there is an abnormality reported by the modules upstream and downstream of the modules. Such a policy may be applicable, for example, each time the reported data superscalar is not so severe and does not trigger a real-time exception mechanism.

As a further example, the trigger condition may include receiving relatively severe anomaly data (e.g., greater than 50%) and thereby triggering a query and summary of the data of the module or related modules to determine whether a fault identification policy is satisfied.

According to some embodiments, acquiring operational data includes acquiring first operational data from at least one module including a first module; responding to the value of the first operation data meeting the preliminary fault judgment condition, and acquiring the second operation data acquired previously, wherein the value of the second operation data does not meet the preliminary fault judgment condition; and using the first operation data and the second operation data as operation data. In other words, data for which the preliminary judgment is not significantly out of standard (e.g., the CPU is out of standard by only 10%, the name service is not significantly abnormal or low-level abnormal) may be stored. After that, when operation data (such as a very obvious name service abnormal state value, a CPU occupation value exceeding 40% or 50%, etc., a plurality of association examples reporting errors, etc.) which significantly exceeds a threshold value or is abnormal is collected, such data can be directly pushed to a fault recognition unit for determination and event recognition, and the previously stored non-exceeding data is read for summarization and analysis.

According to some embodiments, the fault handling event comprises at least one of: repairing an instance, directly migrating an instance, and stopping service and migrating an instance. Therefore, the automatic control of faults can be realized, the manual intervention and labor cost are reduced, and the fault processing efficiency is improved. The out-of-service and then migration may be more time-efficient than direct migration. It is to be understood that the above is merely an example and that the present disclosure is not limited thereto. Optionally, a security lock mechanism may be implemented for instances of the processing module, upper limit control over the number and proportion of abnormal instances automatically processed, and enable higher level security policies or output warnings to indicate that manual intervention is required, etc. when the upper limit is exceeded.

According to some embodiments, before performing the fault handling event on the first instance, the method further comprises: determining another fault handling event for another instance different from the first instance; in response to determining that the failure handling event meets the aggregate condition with another failure handling event, the failure handling event is merged with the other failure handling event. The summary conditions may be the same fault type (and thus the same or similar fault handling events). The summary condition may also be an instance location or module, etc. For example, the computed fault instances and fault events may be sent to a summary section or referred to as a self-healing event bus, and then summarized, categorized, and distributed. For example, the entire service system may be deployed on another computing cloud that is different from the computing cloud to be managed, and use a different failure processor to be responsible for different modules/instances, etc. As an example of assigning fault processors by fault handling event type, a processor may include the following classifications: direct migration instance processors; firstly stopping service and then migrating an instance processor; a machine fault migration processor; a multi-instance exception handler. It is to be understood that the present disclosure is not so limited.

The fault handling scheme according to one or more embodiments of the present disclosure may be implemented in combination with the program itself, for example, by embedding an operation data monitoring module or the like in the program module, and thus may not depend on a cloud service provider or a third party, thereby enabling to improve the efficiency of fault recognition at least by reducing the transmission amount of data and the scale of the fault recognition system.

The data flow during fault identification and handling according to an embodiment of the present disclosure is described below with reference to fig. 4. As shown in fig. 4, the system may include a data acquisition layer 410, an event decision layer 420, and an execution processing layer 430. It is to be understood that the partitioning herein is merely a functional example and does not require such partitioning physically. As one example, two or all of the layers may be physically located together, or each of the layers may be physically separated into further functional modules and communicatively coupled to implement the functionality described herein. Those skilled in the art will appreciate that the present disclosure is not so limited.

The data collection layer 410 may be used to obtain various types of operational data from computer program services, including, but not limited to, metrics data 411, topology data 412, name service data 413, and the like. As previously described, the data collection layer may simply aggregate and process the operational data (e.g., through the index collection aggregation 414) and then store it in the database 415; alternatively, the data collection layer may push data directly to the event decision layer 420 according to simple policies or priorities, such as sending operational data of high likelihood of failure to the event decision layer 420, while storing indicators of low likelihood of failure in the database 415 for later use.

The event decision layer 420 may perform further analysis processing on the acquired operational data and compare the operational data to a predetermined fault handling policy, for example, using fault handling policy identification 421, to determine a fault handling event. For example, the index data may be compared with a predetermined threshold. For topology data, a majority of upstream reporting and voting decisions, etc. may be taken. The current state value may be checked against the name service data periodically, compared to the state values of other instances under the same service name, and/or a determination of whether an instance is abnormal may be made based on the duration of the state abnormality. Other processing manners of data and examples of anomaly determination strategies have been described above, and are not described in detail herein. The event decision layer may classify and identify the fault scenario according to the operation data collected in real time, the operation data pre-stored in the database, and optional external signals 422, and determine a corresponding decision event according to the identified fault scenario. The external signal may refer to a configuration signal from the outside, manual setting, and the like. The fault handling decision event may then be sent to the execution handling layer 430.

The execution processing layer 430 may aggregate (e.g., using the event aggregation center 431) the decision events of the multiple instances and then distribute the aggregated decision events to different fault processors 432, and send instructions to cause the corresponding fault processors to perform operations such as repairing, replacing, and stopping the service of the computer program by linking with the cloud service where the computer program is located. Therefore, the fault identification and processing of the computer program can be realized, and the self-healing of the service abnormality is realized. According to one or more embodiments, by using operation data different from the traditional dimension as a fault index, fault identification and service self-healing can be realized in the view of a computer program itself, the view of the upstream and downstream of a service network topology, the view of the surrounding of a name service and the like, the normal stability and reliability of the function of the computer service are improved, and the robustness of a complex computer system is further improved.

A fault handling apparatus 500 according to an example embodiment of the present disclosure is described below in connection with fig. 5. The apparatus 500 includes: an operation data acquisition unit 510, a fault event determination unit 520, and a fault processing unit 530. The operation data acquisition unit 510 may be configured to acquire operation data including data of a name service from the first module. The fault event determination unit 520 may be configured to determine a fault handling event in response to determining that the first instance of the first module is faulty based on the operational data. The fault handling unit 530 may be configured to perform a fault handling event on the first instance.

According to some embodiments, the fault event determination unit 520 may include means for determining a duration of the current abnormal state in response to data from the name service of the first module indicating that the latest state of the first instance is the abnormal state; and means for determining that the first instance failed in response to the duration exceeding a first threshold.

According to some embodiments, the failure event determination unit 520 may include means for determining that the first instance failed in response to determining that the first module further includes at least one second instance, and that the data from the name service of the first module indicates that a difference in the state value of the first instance and the state value of the at least one second instance is greater than a second threshold.

According to some embodiments, the operational data may further comprise error reporting data from a third instance of the second module, and wherein the failure event determining unit may comprise means for determining that the first instance failed in response to determining from the network topology data that the error reporting data from the third instance indicates an anomaly associated with the first instance.

According to some embodiments, the operational data may include error-reporting data from a plurality of instances, and wherein the failure event determining unit may include a unit for determining that the first instance failed in response to an exception associated with the first instance meeting at least one of the following conditions: most upstream reporting and voting decisions.

According to some embodiments, the operational data may further comprise index data of the first instance, and wherein the failure event determination unit 520 may comprise means for determining that the first instance failed in response to the index data of the first instance being above a third threshold.

According to some embodiments, the operational data acquisition unit 510 may include a unit for acquiring first operational data from at least one module including a first module; a unit for acquiring previously acquired second operation data in response to the value of the first operation data satisfying the preliminary fault determination condition, the value of the second operation data not satisfying the preliminary fault determination condition; and a unit for using the first operation data and the second operation data as operation data.

According to some embodiments, the apparatus 500 may further include means for determining another fault handling event for another instance different from the first instance; and means for merging the failure handling event with another failure handling event in response to determining that the failure handling event meets a summary condition with the other failure handling event before executing the failure handling event on the first instance.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 6, a block diagram of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the device 600 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as the method 200 or variations thereof, and the like. For example, in some embodiments, the method 200, or variations thereof, etc., may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the method 200 described above, or variants thereof, etc., may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method 200 or variants thereof, etc., in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A fault handling method, comprising:

acquiring operation data, the operation data comprising data of a name service from a first module, wherein the name service is a service function used for describing an instance of the module, can provide service name-based resolution capability for addressing between large-scale computer programs, and can return instance state data under the module, and wherein the name service is used as a fault monitoring index;

In response to determining that the first instance of the first module fails based on the operational data, determining a failure handling event, wherein the failure handling event comprises at least one of: repairing the instance, directly migrating the instance, and stopping service and migrating the instance; and

the fault handling event is performed on the first instance.

2. The method of claim 1, wherein determining that the first instance failed based on the operational data comprises:

determining a duration of the abnormal state in response to data from the name service of the first module indicating that a most recent state of the first instance is an abnormal state; and is also provided with

Responsive to the duration exceeding a first threshold, determining that the first instance failed.

3. The method of claim 1, wherein determining that the first instance failed based on the operational data comprises:

in response to determining that the first module further includes at least one second instance, and that the data from the name service of the first module indicates that a difference in a state value of the first instance and a state value of the at least one second instance is greater than a second threshold, it is determined that the first instance is malfunctioning.

4. The method of any of claims 1-3, wherein the operational data further comprises error reporting data from a third instance of a second module, and wherein determining that the first instance failed based on the operational data comprises:

in response to determining from the network topology data that the error-reporting data from the third instance indicates an anomaly associated with the first instance, determining that the first instance failed.

5. The method of any of claims 1-3, wherein the operational data comprises error-reporting data from a plurality of instances, and wherein determining that the first instance failed based on the operational data comprises:

determining that the first instance failed in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

6. The method of any of claims 1-3, wherein the operational data further comprises index data of the first instance, and wherein determining that the first instance failed based on the operational data comprises:

responsive to the index data of the first instance being above a third threshold, determining that the first instance is malfunctioning.

7. A method according to any of claims 1-3, wherein obtaining operational data comprises:

collecting first operational data from at least one module including the first module;

acquiring previously acquired second operation data in response to the value of the first operation data meeting a preliminary fault judgment condition, wherein the value of the second operation data does not meet the preliminary fault judgment condition; and

the first operation data and the second operation data are used as the operation data.

8. A method according to any of claims 1-3, wherein prior to performing the fault handling event on the first instance, the method further comprises:

determining another fault handling event for another instance different from the first instance;

and merging the fault handling event with the other fault handling event in response to determining that the fault handling event and the other fault handling event meet a summary condition.

9. A fault handling apparatus comprising:

an operation data acquisition unit configured to acquire operation data including data of a name service from a first module, wherein the name service is a service function for describing an instance of the module, is capable of providing service name-based resolution capability for addressing between large-scale computer programs, and is capable of returning instance state data under the module, and wherein the name service is used as a failure monitor index;

A fault event determination unit configured to determine a fault handling event in response to determining that the first instance of the first module fails based on the operational data, wherein the fault handling event includes at least one of: repairing the instance, directly migrating the instance, and stopping service and migrating the instance; and

and the fault processing unit is used for executing the fault processing event on the first instance.

10. The apparatus of claim 9, wherein the malfunction event determination unit comprises:

means for determining a duration of the abnormal state in response to data from the name service of the first module indicating that a most recent state of the first instance is an abnormal state; and

and means for determining that the first instance failed in response to the duration exceeding a first threshold.

11. The apparatus of claim 9, wherein the malfunction event determination unit comprises:

in response to determining that the first module further includes at least one second instance, and that the data from the name service of the first module indicates that a difference in a state value of the first instance and a state value of the at least one second instance is greater than a second threshold, determining that the first instance is malfunctioning.

12. The apparatus of any of claims 9-11, wherein the operational data further comprises error reporting data from a third instance of a second module, and wherein the fault event determination unit comprises:

and means for determining that the first instance failed in response to determining from the network topology data that the error-reporting data from the third instance indicates an anomaly associated with the first instance.

13. The apparatus of any of claims 9-11, wherein the operational data comprises error-reporting data from a plurality of instances, and wherein the failure event determination unit comprises:

means for determining that the first instance is malfunctioning in response to an exception associated with the first instance satisfying at least one of the following conditions: most upstream reporting and voting decisions.

14. The apparatus of any of claims 9-11, wherein the operational data further comprises index data of the first instance, and wherein the fault event determination unit comprises:

and means for determining that the first instance failed in response to the index data of the first instance being above a third threshold.

15. The apparatus according to any one of claims 9-11, wherein the operation data acquisition unit includes:

means for collecting first operational data from at least one module including the first module;

a unit for acquiring previously acquired second operation data in response to the value of the first operation data satisfying a preliminary fault determination condition, the value of the second operation data not satisfying the preliminary fault determination condition; and

and a unit for using the first operation data and the second operation data as the operation data.

16. The apparatus of any of claims 9-11, wherein the apparatus further comprises:

means for determining another fault handling event for another instance different from the first instance; and

and means for merging the failure handling event with the other failure handling event in response to determining that the failure handling event meets a summary condition with the other failure handling event before executing the failure handling event on the first instance.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.