CN109559583B

CN109559583B - Fault simulation method and device

Info

Publication number: CN109559583B
Application number: CN201710888636.8A
Authority: CN
Inventors: 蔡俊杰; 童春杰; 王朝金
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2022-04-05
Anticipated expiration: 2037-09-27
Also published as: CN109559583A

Abstract

The embodiment of the application provides a fault simulation method and a device thereof, wherein the method comprises the following steps: acquiring service statistical data from the service provided by the distributed system in the running state; analyzing the service statistical data, and generating a fault logic according to an analysis result, wherein the fault logic is used for simulating a fault; and distributing the fault logic to a simulation node, and simulating the fault by the simulation node according to the fault logic. According to the fault simulation method and device, the efficiency of fault simulation can be improved, and the fault simulation result is more fit to an actual scene.

Description

Fault simulation method and device

Technical Field

The embodiment of the application relates to the technical field of distributed systems, in particular to a fault simulation method and a fault simulation device.

Background

A distributed system is a computer system consisting of multiple interconnected processing resources that cooperate to perform the same task under the control of the entire system, relying minimally on centralized programs, data, or hardware. In a large distributed system, the failure of software and hardware is inevitable.

With the deep application of micro service architecture in full cloud environment, the system complexity is further increased, for example, the size of the fault occurrence space for mutual calling among 100 micro services is 100 times of 2, that is, even if the reliability of each micro service is 99.9%, the reliability of a function depending on 100 micro services is only 90.5%. Fault injection is a reliability verification technique, which deliberately introduces faults into a system through controlled experiments and observes the behavior of the system when faults exist, in other words, fault injection is a way of simulating faults. Fault injection is an important means for verifying high availability of large distributed clouded systems. The injected fault types generally include a resource-class fault and a service-class fault, wherein the emphasis is on the service-class fault because the resource-class fault of a certain node finally represents or the service provided by the node fails. Therefore, simulating a service-like fault can more effectively find possible problems of a large-scale distributed system, and how to efficiently inject the fault is a common problem in the current industry.

Currently, the fault injection method in the industry is mainly a linear-drive fault injection (LDFI) method. The LDFI method reversely infers possible fault injection points through the calling relationship among various services when the distributed system normally operates, and the fault injection points are injected to possibly cause system abnormity, so that the possible fault injection points are discovered. However, the LDFI method analyzes a possible fault injection point only by a call relation between services included in the distributed system, resulting in inefficient fault simulation.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a fault simulation method and a device thereof, which can improve the efficiency of fault simulation and make the fault simulation result more fit to the actual scene.

A first aspect of an embodiment of the present application provides a fault simulation method, including:

step 1: acquiring service statistical data from the service provided by the distributed system in the running state;

step 2: analyzing the service statistical data, and generating a fault logic according to an analysis result, wherein the fault logic is used for simulating a fault;

and step 3: and distributing the fault logic to the simulation nodes, and simulating the fault by the simulation nodes according to the fault logic.

In a second aspect, a fault simulation apparatus is provided, which includes a processing unit and a transceiver unit,

the processing unit is used for acquiring service statistical data from the service provided by the distributed system in the running state;

the processing unit is also used for analyzing the service statistical data and generating fault logic according to the analysis result, and the fault logic is used for simulating faults;

and the transceiving unit is used for distributing the fault logic to the simulation node, and the simulation node simulates the fault according to the fault logic.

A third aspect of the embodiments of the present application provides a fault simulation apparatus, including at least one processing element and at least one memory element, where the at least one memory element is used to store programs and data, and the at least one processing element is used to execute the method provided by the first aspect of the embodiments of the present application.

A fourth aspect of embodiments of the present application provides a fault simulation apparatus comprising at least one processing element (or chip) for performing the method of the first aspect above.

A fifth aspect of embodiments of the present application provides a program, which when executed by a processor is configured to perform the method of the first aspect.

A sixth aspect of embodiments herein provides a program product, e.g., a computer readable storage medium, embodying the program of the fifth aspect.

Therefore, in each aspect, the fault logic is generated by analyzing the service statistical data provided by the distributed system and is distributed to the simulation nodes, and the simulation nodes simulate the fault according to the fault logic, so that the fault simulation efficiency can be improved.

In a possible implementation manner, the fault logic includes a fault injection point, a fault injection time and a fault type, and the fault logic is configured to instruct the simulation node to inject a fault corresponding to the fault type for the fault injection point at the fault injection time, so that the simulation node simulates the fault according to the fault logic.

The fault injection point may be an analog node, or may be a certain node or a certain function in the analog node, and is used to indicate which component or node or function in the node injects the fault, that is, to indicate which component or node or function in the node simulates the fault.

The fault injection time is used for indicating the time for injecting the fault into the fault injection point, and can be point time or segment time.

The fault type is used to distinguish between different faults and to indicate which fault is injected.

In a possible implementation manner, the service statistical data may include traffic statistical data, call statistical information, system error information, and the like.

In a possible implementation manner, before a fault is simulated by a simulation node according to fault logic, a fault request determination rule is sent to the simulation node, the fault request determination rule is used for the simulation node to determine whether an intercepted request is a fault request, and the simulation node injects a fault corresponding to a fault type to a fault injection point at fault injection time to simulate the fault under the condition that the intercepted request is determined to be the fault request.

In a possible implementation manner, the specific process of analyzing the service statistical data and generating the fault logic according to the analysis result may be to analyze the service statistical data by using a preset data analysis model and generate the fault logic according to the analysis result. The preset data analysis model defines how to process the service statistical data, and different data analysis models can be corresponding to different types of service statistical data. The preset data analysis model is combined with the service statistical data for calculation and analysis, an intelligent fault injection scheme, namely fault logic, can be automatically generated, and the intelligence of the fault simulation device can be improved.

In one possible implementation, the process of distributing the fault logic to the simulation nodes may be to distribute the fault logic to all simulation nodes with fault simulation capability, and the simulation nodes simulate the fault according to the fault logic under the condition that the fault injection point is confirmed to be matched with the simulation nodes.

In a possible implementation manner, the process of distributing the fault logic to the simulation nodes may be to distribute the fault logic to the simulation nodes corresponding to the fault injection points, that is, to distribute the fault logic in a targeted manner, so that the fault simulation is targeted.

In a possible implementation manner, after the simulation node simulates a fault according to the fault logic, the simulation node may destroy the fault instance corresponding to the fault logic, that is, the simulation node supports a hot plug manner, after a fault is simulated, the fault instance corresponding to the fault is destroyed, and the hot plug manner enables the simulation node to dynamically obtain new fault capability without restarting.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

Fig. 1 is a schematic system architecture diagram of a fault simulation system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a fault simulation method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of another fault simulation method provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of a fault execution module according to an embodiment of the present application;

fig. 5 is a schematic diagram of a logic structure of a fault simulation apparatus according to an embodiment of the present application;

fig. 6 is a simplified schematic diagram of an entity structure of a fault simulation apparatus according to an embodiment of the present application.

Detailed Description

The following describes a fault simulation method and a fault simulation apparatus provided in the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, a system architecture of a fault simulation system according to an embodiment of the present disclosure is shown. The system architecture diagram mainly includes a service module 100, a fault injection engine 200, and a fault execution module 300.

The service module 100 is a service provided when the distributed system is in an operating state. The services included by the service module 100 may vary depending on the functionality and implementation of each distributed system. Typically, a large distributed system will build several basic services: traffic statistics services, services that monitor Remote Procedure Calls (RPCs), log information (log) monitoring and collection services, and so on. Accordingly, the service modules 100 may include a traffic statistics service 110, an RPC call monitoring service 120, a log collection service 130, and other services. It should be noted that the types and the number of the services shown in fig. 1 are for example and do not constitute a limitation to the embodiments of the present application.

The traffic statistics service 110 may be used to provide traffic information, which may include traffic statistics for a distributed system over a period of time, e.g., providing traffic statistics for a week or a day, in order to analyze the period of a peak in access or a trough in access during the day. Analyzing traffic information provided by the traffic statistics service 110, time periods in which there is a peak or valley in the amount of access or other situations can be obtained.

The RPC call monitoring service 120 may be used to provide RPC call statistics that provide call information for each service over different time periods, including the number of requests called and the time spent calling. As the number of invocation requests increases, the latency of service invocation may also increase. The RPC call statistical information provided by the RPC call monitoring service 120 is analyzed, and the call delay interval of the service can be obtained. The RPC call monitoring service 120 may also provide a list of frequently faulty services, analyze the list of frequently faulty services, and obtain frequently faulty services or nodes or components.

The log collection service 130 may be used to provide system error information or system anomaly information for the distributed system, as well as the location of the system error information or system anomaly information in the distributed system, i.e., the function in the service or component or node that generated the error information or anomaly information. Analyzing the error or exception information provided by the log collection service 130 may obtain the services or components or nodes or functions in the nodes that are in error or exception.

Therein, fault injection engine 200 is used to analyze data, generate faults, and inject faults. The fault injection engine 200 obtains service statistical data of various services from the service module 100, analyzes the service statistical data, generates a fault according to an analysis result, and finally injects the generated fault into a service or a component or a node or a function in the node.

The fault execution module 300 has a fault execution capability, supports dynamic plugging and unplugging faults, can dynamically simulate faults, and enables a drilling capability of a newly added fault in an uninterrupted manner. The fault execution module 300 may be integrated as a component in either a compiled or a run state by a service or node to support fault injection, such as shown in fig. 1, integrated in the service or node shown in the hexagon. The fault execution module 300 may also be run on the target node as a separate process, rather than being tightly integrated with the services on the node; and the fault injection engine or a third-party component can be used for dynamically injecting the fault injection engine or the third-party component into a process running on each node of the distributed system.

The system architecture diagram may further include a terminal device, which may be the smart phone and the notebook shown in fig. 1, or may be a computer, a smart wearable device, a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety, a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), and the like.

During the process of communication or internet activity of the terminal device, one or more nodes in the distributed system may participate, so that each service of the distributed system may generate corresponding service data, and further, corresponding service statistical data may be generated. For example, when the terminal device accesses a web page, the traffic statistics service 100 may generate traffic data, and may generate the traffic statistics data for a period of time.

It should be noted that the names of the service module 100, the fault injection engine 200, and the fault execution module 300 are not limited to the embodiments of the present application, and other words may be used for description in practical applications.

The terms or names referred to in the embodiments of the present application will be explained below:

fault, refers to an unexpected or unacceptable internal condition that occurs during system operation. If the software fails, the software fails without proper error (fault-tolerant) processing.

Failure, refers to an unexpected or unacceptable external behavioral result that results when the system is running.

Error (error), refers to an undesirable or unacceptable human error in life span that may trigger the discovery of a defect.

Defect (defect), refers to an undesirable or unacceptable deviation that exists in the system. A system operating under a particular condition may activate a defect, resulting in a failure.

In the embodiment of the present application, the faults, failures, errors, and defects are collectively described as faults, which may affect the operation of the system.

And the failure mode is used for normative description of the failure and is used for describing the failure performance characterized by the functional object. The description of the failure mode requires concise refinement, and the refinement adopts a main and subordinate structure (such as timer delay and database deadlock), and the refinement follows a certain principle.

The fault injection object refers to a tested functional object which generates a fault when a fault mode is applied. In the embodiment of the present application, a fault injection object is described as a fault injection point.

The fault injection time refers to the time for injecting the fault to the fault injection point, and may be a point time or a period time.

And the fault execution time refers to the time for executing the fault at the fault injection point.

And the fault type is used for distinguishing different faults.

The fault simulation method provided by the embodiment of the present application will be described in detail below with reference to fig. 2 and 3.

Referring to fig. 2, a schematic flow chart of a fault simulation method according to an embodiment of the present application is provided, where the method may include, but is not limited to, the following steps:

step S201: service statistics are obtained from services provided by the distributed system in a running state.

The fault injection engine 200 obtains service statistics data from services provided by the running distributed system, specifically obtains corresponding service statistics data from services provided by the running distributed system, for example, obtains traffic statistics data from the traffic statistics service 100 shown in fig. 1, obtains RPC call statistics information from the RPC call monitoring service 120 shown in fig. 1, and obtains error information or exception information from the log collection service 130 shown in fig. 1.

It is to be understood that the service statistics comprise corresponding service statistics per service offering of at least one service, i.e. the service statistics comprise at least one service statistic.

In one possible implementation, in a case where the distributed system is in a running state, each service included in the service module 100 periodically or aperiodically sends each generated service statistic to the fault injection engine 200, for example, a certain service sends each generated service statistic to the fault injection engine 200 every week. The fault injection engine 200 receives service statistics sent by the services.

In a possible implementation manner, the fault injection engine 200 sends a data acquisition instruction to a service included in the service module 100, and when receiving the acquisition instruction, the corresponding service provides corresponding service statistical data to the fault injection engine 200. The fault injection engine 200 may send data acquisition instructions to the services included in the service module 100 in the event that a fault needs to be simulated. The fault injection engine 200 may specifically send a data acquisition instruction to the service included in the service module 100, that is, which type of service statistical data is needed, and send the data acquisition instruction to the service corresponding to the type.

In a possible implementation manner, the service statistics data generated by each service included in the service module 100 when the distributed system is in the running state is stored in a memory of the fault simulation system, and the fault injection engine 200 extracts the required service statistics data from the memory when the fault needs to be simulated.

In a possible implementation manner, the fault injection engine 200 may directly call an Application Program Interface (API) of a corresponding service to obtain service statistical data corresponding to the service in the running state of the distributed system, where a data obtaining instruction is not needed. An application program interface is a set of definitions, procedures and protocols that enable the intercommunication of computer software through an API. One of the primary functions of an API is to provide a common set of functions. The API is also a middleware and provides data sharing for various platforms.

Step S202: and analyzing the service statistical data and generating fault logic according to the analysis result.

The fault injection engine 200 analyzes the acquired service statistical data and generates fault logic according to the analysis result. The fault logic is to simulate a fault.

The fault logic comprises a fault injection point, fault injection time and a fault type and is used for indicating a simulation node to inject a fault corresponding to the fault type aiming at the fault injection point at the fault injection time so as to enable the simulation node to simulate the fault. Generally, the fault in the actual scene includes a fault occurrence point, a fault occurrence time and a fault type, so that the generated fault logic is relatively practical.

It should be noted that, if the fault injection point executes the fault when injecting the fault corresponding to the fault type, the fault injection time and the fault occurrence time may be approximately the same, so that the fault logic is more practical.

It is to be understood that the fault injection point, the fault injection time, and the fault type are obtained by analyzing the service statistics by the fault injection engine 200, and the fault injection point is used to indicate which function in which component or node injects the fault, i.e., which component or node or function in the node will execute the fault (or simulate the fault). The fault injection points may also indicate which components or nodes or functions in a node are injecting faults. The fault injection time is used to indicate when a fault is injected and the fault type is used to indicate which type of fault is injected.

The fault injection time may be obtained by analyzing the service statistical data, and may also be a default injection time of the fault simulation system or the fault injection engine 200.

In one example, the fault injection engine 200 obtains access statistics of the entire distributed system from the traffic statistics service 110 shown in fig. 1 over a certain period of time, for example, access traffic statistics of a certain day, and analyzes a period of peak access traffic of the day from the access traffic statistics of the day, and the fault injection engine 200 may use this period as the fault injection time, and instruct the simulation node to inject a fault in this period of time, so as to simulate the behavior of the distributed system in a scenario with a large access amount, and may check the reliability of the distributed system.

In one example, the fault injection engine 200 obtains, from the RPC call monitoring service 120 shown in fig. 1, call statistics information of each service in different time periods when the distributed system is in a running state, analyzes a call delay interval of each service according to the call statistics information, and the fault injection engine 200 may generate a fault type according to the call delay interval of a certain service, where the fault type may be a delay fault, and a delay duration of the delay fault may be determined by the call delay interval of the service, so as to simulate an influence of the delay fault on the service and an influence on the distributed system.

The fault injection engine 200 obtains the system error information of the distributed system in the running state from the log collection service 130 shown in fig. 1, and dynamically generates a fault simulating the abnormal type according to the system error information, so that the fault simulation is closer to a real scene.

In one example, the fault injection engine 200 obtains a fault service list of the distributed system in a running state from the RPC call monitoring service 120 shown in fig. 1, selects a critical service or node to be covered by fault injection according to the fault service list, and uses the selected service or node as the fault injection point, so as to simulate a fault at the fault injection point.

The fault injection engine 200 obtains the positions of system error information or system abnormal information of the distributed system in the running state from the log collection service 130 shown in fig. 1, and takes the points corresponding to the positions as the fault injection points, so as to simulate the fault at the fault injection points. For example, log may record that a service often generates an exception of some sort, and which method that can locate the service produced the exception. If the service is a weak link in the distributed system that is prone to errors, then the fault injection engine 200 can purposefully inject faults into the method to simulate the type of anomalies that the method generates logs, thereby verifying the ability of the entire distributed system to handle the fault.

The above examples are for illustration, and the practical application may not be limited to the above examples.

In one approach, the fault injection engine 200 analyzes the service statistics using a data analysis model and generates fault logic based on the analysis. Wherein the data analysis model is a preset model or rule of the fault injection engine 200, and defines how to process the service statistical data obtained from the service module 100. The data analysis model includes a plurality of analysis models corresponding to different types of service statistics, for example, one analysis model corresponding to service statistics provided by the traffic statistics service 110 and another analysis model corresponding to service statistics provided by the log collection service 130. The analysis model can convert the acquired service statistics into fault injection related information, such as fault injection points, fault injection time, fault types, and the like.

It is understood that the fault injection engine 200 combines the service statistics with the data analysis model to perform a computational analysis, and automatically generates an intelligent fault injection scheme indicating a fault injection point, a fault injection time, a fault type, and the like. The fault injection scheme is fault logic.

In one example, the fault injection engine 200 combines the list of error services provided by the RPC call monitoring service 120 and its corresponding analysis model, calculates and analyzes, and generates the fault injection points. Or the fault injection quote 200 combines the system error information or system abnormal information provided by the log collection service 130 and the corresponding analysis model, calculates and analyzes, and generates a fault injection point.

In one example, the fault injection engine 200 combines the traffic statistics provided by the traffic statistics service 110 with its corresponding analytical model, calculates an analysis, and generates a fault injection time. For example, to simulate whether a distributed system can well handle a part of services with a relatively large delay in a high-access high-concurrency scenario, and thus to exercise the reliability of the system in a larger-access scenario, a fault needs to be injected during an on-peak period. The traffic statistics can be used as one of the reference information to select the peak time period of the system traffic.

In one example, the fault injection engine 200 combines the RPC call statistics provided by the RPC call monitoring service 120 and its corresponding analysis model, calculates and analyzes, and generates a delay fault to simulate a problem that may occur when the service call is further delayed. Or the fault injection engine 200 combines the system error information provided by the log collection service 130 and the corresponding analysis model thereof, and performs calculation analysis to generate a fault type, so that the fault simulation is closer to a real scene.

The fault injection engine 200 generates fault logic according to the service statistical data, so that the generated fault logic fits an actual scene better.

Some commonly used fault types may be preset in the fault injection engine 200, and in the case that the fault type cannot be generated, the preset fault types may be adopted to generate fault logic.

Some fault logic may be preset in fault injection engine 200 and may be used to inject into the simulation node in the event that the service statistics provided by service module 100 are not complete or sufficient to generate fault logic.

Step S203: and distributing the fault logic to a simulation node, and simulating the fault by the simulation node according to the fault logic.

The fault injection engine 200 distributes the generated fault logic to the simulation nodes, which simulate the fault according to the fault logic.

In one possible implementation, the fault injection engine 200 may dynamically distribute the fault logic in the form of plug-ins to each component or node in the distributed system having a fault execution module 300. In other words, the simulation nodes are all components or nodes having the fault execution module 300, having fault simulation capability.

Taking a simulation node as an example, the simulation node may determine, when receiving the fault logic, whether the fault injection point included in the fault logic matches the simulation node, that is, determine whether the fault injection point is the simulation node or belongs to the simulation node, and if the determination result is yes, inject the fault corresponding to the fault type for the fault injection point at the fault injection time, so that the fault injection point executes the fault to simulate the fault when receiving the fault. And under the condition that the simulation node receives the fault logic, judging whether the fault injection point included by the fault logic is matched with the simulation node or not by combining a fault request judgment rule. Here, the failure request determination rule will be described in step S204.

In one possible implementation, fault injection engine 200 may distribute the fault logic in the form of plug-ins to the simulation nodes corresponding to the fault injection points. In other words, the simulation node is a node or a component corresponding to the fault injection point. At this time, the simulation node may be considered by default to have all components or nodes of the fault execution module 300, with fault simulation capability.

If the fault injection point is a component or a node, the simulation node is the component or the node, and the component or the node injects the fault corresponding to the fault type at the fault injection time under the condition that the component or the node receives the fault logic, namely the fault is executed at the fault injection time so as to simulate the fault. If the fault injection point is a certain component or a certain function of a node, the simulation node is the component or the node, and the component or the node injects the fault corresponding to the fault type to the fault injection point at the fault injection time under the condition that the fault logic is received, so that the fault is executed by the fault injection point under the condition that the fault is received by the fault injection point to simulate the fault.

The fault logic supports hot plug, and the simulation node can dynamically acquire new fault capacity without restarting.

In the embodiment described in fig. 2, the fault injection engine obtains the service statistical data from the service provided by the distributed system in the running state, analyzes the service statistical data, generates the fault logic according to the analysis result, and distributes the fault logic to the simulation node, so that the simulation node simulates the fault according to the fault logic, thereby improving the efficiency of fault simulation, and generates the fault logic according to the service statistical data, so that the fault simulation result is more suitable for the actual scene. In addition, the embodiment of the application does not need manual intervention, can reduce the research and development and test investment, and can improve the intelligence of the fault simulation system.

In one possible implementation, the fault injection engine 200 further performs step S204 before performing step S203. Step S204 may be executed simultaneously with step S203, or may be executed after step S203.

Step S204: and sending a fault request judgment rule to the simulation node.

The fault injection engine 200 sends a fault request determination rule to the simulation node, where the fault request determination rule is used for the simulation node to determine whether the intercepted request is a fault request.

The fault injection engine 200 may pre-configure or dynamically configure the fault request decision rule and send the fault request decision rule to all services or components or nodes having the fault execution module 300, i.e., to all simulation nodes. And all simulation nodes and fault injection engine 200 may update the fault request decision rule synchronously, i.e., all simulation nodes update as the fault injection engine 200 updates the fault request decision rule.

At this time, the fault logic is configured to instruct the simulation node to inject the fault corresponding to the fault type to the fault injection point at the fault injection time to simulate the fault when it is determined that the intercepted request is the fault request according to the fault request determination rule.

Typically, the fault request decision rule may include a time and the fault injection point. The fault execution module 300 in the simulation node combines the fault request determination rule with the fault injection point and the fault injection time information in the fault logic to generate a set of complete fault determination rules, and uses the complete fault determination rules to perform fault request determination on the intercepted request. If the interception time of the intercepted request is consistent with or matched with the time, and the request object of the request is the same as the fault injection point, the simulation node may determine that the intercepted request is a fault request. For example, if the time is 9 to 9 and a half of the morning of each day, the injection point is node a, and the simulation node sufficiently intercepts the request for node a at 9, the simulation node may determine the request as a faulty request. The fault request determination rule may further include a preset proportion, where the preset proportion is used to indicate that the fault injection point selects a request simulation fault of a preset proportion, for example, 10% of the request simulation faults are selected, that is, fault simulation is performed on a part of the requests, which does not affect normal operation of the distributed system, and fault simulation can also be performed.

Referring to fig. 3, a schematic flow chart of another fault simulation method provided in the embodiment of the present application is shown, where the method may include, but is not limited to:

step S301: fault logic is received.

The simulation node receives the fault logic distributed by the fault injection engine 200, and the description of the fault logic may refer to the description thereof in fig. 2, which is not described herein again. The description is given by taking an analog node as an example, and the analog node may or may not be a node corresponding to the fault injection point.

Step S302: simulating a fault according to the fault logic.

The simulation node determines a fault injection point, a fault injection time and a fault type according to the fault logic, and injects a fault corresponding to the fault type to the fault injection point at the fault injection time, so that the fault injection point executes the fault, specifically, the fault execution module 300 included in the fault injection point executes the fault.

And under the condition that the simulation node receives the fault logic, judging whether the fault injection point included in the fault logic is matched with the simulation node or not by combining a fault request judgment rule, and under the condition that the fault injection point is matched with the simulation node, injecting the fault corresponding to the fault type into the fault injection point by the fault injection time so as to enable the fault injection point to execute the fault.

Referring to fig. 4, which is a schematic structural diagram of a fault execution module 300 according to an embodiment of the present disclosure, the fault execution module 300 includes a request interceptor 310, a fault logic directory, a fault load 330, a fault logic library 340, and a fault logic lifecycle management 350. It should be noted that the fault execution module 300 shown in fig. 4 does not constitute a limitation to the embodiment of the present application.

If the simulation node receives the fault request determination rule sent by the fault injection engine 200, the request interceptor 310 determines whether the intercepted request is a fault request. The request interceptor 310 is responsible for intercepting various types of requests, including a request for a node, a request for a function in a node, and the like, determining whether the request is a fault request, and injecting a fault if the request is determined to be a fault request.

Generally, the fault request decision rule may include time and fault injection point. The fault execution module 300 combines the fault request determination rule with the fault injection point and the fault injection time information in the fault logic to generate a set of complete fault determination rules, and performs fault request determination on the intercepted request by using the complete fault determination rules. If the interception time of the request intercepted by the request interceptor 310 is consistent with or matches the time and the request object of the request is the same as the fault injection point, the request interceptor 310 may determine that the intercepted request is a fault request. For example, if the time is 9 to 9-half of the morning each day, the injection point is node a, and the simulation node sufficiently intercepts the request for node a at 9, the request interceptor 310 may determine the request as a faulty request. The fault request determination rule may further include a preset proportion, where the preset proportion is used to instruct the fault injection point to select a preset proportion of request simulation faults, for example, 10% of the request simulation faults are selected, that is, fault simulation is performed on a part of the requests, which does not affect normal operation of the distributed system, and fault simulation can also be performed.

The fault logic directory 320 records information of all local fault logics, and it can also record fault logics contained in the fault injection engine only existing on the server side. When the fault interceptor 310 recognizes a request as a fault request and wants to execute fault logic, it first looks at the fault logic directory 320 to ensure that the fault logic exists in the local fault logic repository 340. If the selected fault logic does not exist, the fault loading module 330 may dynamically download the fault logic from the fault injection engine 200. The downloaded fault logic may be saved to the fault logic library 340.

The fault logic lifecycle management 350 is responsible for initializing fault logic, creating an instance of the logic, executing fault logic injection faults, and destroying the fault instance after the fault execution is completed. The whole process from loading to executing of the fault logic is completed dynamically, namely a hot plug mode. Supporting a new fault type does not require modifying or restarting the fault execution module and its services and nodes.

Step S303: and destroying the fault instance corresponding to the fault logic.

After the fault injection point executes the fault logic, the fault logic lifecycle management 350 in the simulation node destroys the fault instance corresponding to the fault logic, so as to reduce the influence of the simulation fault on the simulation node.

In the embodiment described in fig. 3, the simulation node simulates the fault according to the fault logic when receiving the fault logic, so that the fault simulation result is more suitable for the actual scene.

The method of the embodiment of the present application is explained in detail above, and the apparatus provided by the embodiment of the present application, which corresponds to the fault injection engine 200 shown in fig. 1, will be described below.

Referring to fig. 5, which is a schematic diagram of a logic structure of a fault simulation apparatus according to an embodiment of the present disclosure, the apparatus 50 may include a processing unit 501 and a transceiver unit 502. The processing unit 501 may be configured to perform operations for controlling the fault simulation apparatus, for example, perform step S201 and step S202 in the embodiment shown in fig. 2, that is, obtain service statistical data from a service provided by the distributed system in the running state, analyze the service statistical data, and generate the fault logic transceiving unit 502 according to the analysis result, which may be configured to communicate with the simulation node, for example, perform step S203 in the embodiment shown in fig. 2.

The fault logic comprises a fault injection point, fault injection time and a fault type and is used for indicating the simulation node to inject a fault corresponding to the fault type aiming at the fault injection point at the fault injection time so as to simulate the fault by the simulation node.

The transceiver 502 is further configured to execute step S204 in the embodiment shown in fig. 2, and send the fault request determination rule to the simulation node, so that the simulation node determines whether the intercepted request is a fault request. At this time, the fault logic is configured to instruct the simulation node to inject a fault corresponding to the fault type at the fault injection time to the fault injection point, when it is determined that the intercepted request is the fault request according to the fault request determination rule, so that the simulation node simulates the fault.

The processing unit 501 is configured to, when analyzing the service statistical data and generating a fault logic according to an analysis result, specifically, analyze the service statistical data by using a preset data analysis model and generate the fault logic according to the analysis result.

When the transceiver 501 distributes the fault logic to the simulation nodes, the fault logic may be distributed to all simulation nodes with fault simulation capability, or the fault logic may be distributed to the simulation nodes corresponding to the fault injection points in a targeted manner.

Referring to fig. 6, which is a simplified schematic diagram of a physical structure of a fault simulation apparatus according to an embodiment of the present disclosure, the apparatus 60 may include a transceiver 601, a processor 602, and a memory 603. The transceiver 601, processor 602 and memory 603 may be interconnected via a bus 604, or may be connected in other ways. The related functions implemented by the processing unit 501 shown in fig. 5 may be implemented by one or more processors 602. The related functions implemented by the transceiving unit 502 shown in fig. 5 may be implemented by the transceiver 601.

The memory 603 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), and the memory 603 is used for storing relevant program codes and data.

The transceiver 601 is used for transmitting data and/or signaling and for receiving data and/or signaling. In the embodiment of the present application, the transceiver 601 is used to send a fault logic, a fault request determination rule to the simulation node.

The processor 602 may include one or more processors, such as one or more Central Processing Units (CPUs), and in the case that the processor 602 is one CPU, the CPU may be a single-core CPU or a multi-core CPU. In the embodiment of the present application, the processor 602 is configured to execute step S201 and step S202 in the embodiment shown in fig. 2.

For the steps executed by the processor 602 and the transceiver 601, reference may be specifically made to the description of the embodiment shown in fig. 2, and details are not described here again.

It will be appreciated that fig. 6 shows only a simplified design of the fault simulation device. In practical applications, the fault simulation apparatus may also include necessary other components, including but not limited to any number of transceivers, processors, controllers, memories, communication units, etc., respectively, and all devices that can implement the present application are within the protection scope of the present application.

One of ordinary skill in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the above method embodiments. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc. Accordingly, a further embodiment of the present application provides a computer-readable storage medium having stored therein instructions, which, when executed on a computer, cause the computer to perform the method of the above aspects.

Yet another embodiment of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

Those of ordinary skill in the art would appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A method of fault simulation, comprising:

acquiring service statistical data from a service provided by a distributed system in a running state, wherein the service statistical data comprises flow statistical data, calling statistical information and system error information;

analyzing the service statistical data, and generating a fault logic according to an analysis result, wherein the fault logic is used for simulating a fault;

distributing the fault logic to a simulation node, and simulating the fault by the simulation node according to the fault logic;

before simulating, by the simulation node, the fault according to the fault logic, the method further includes:

sending a fault request judgment rule to the simulation node, wherein the fault request judgment rule is used for the simulation node to judge whether the intercepted request is a fault request;

the fault request judgment rule comprises fault injection time and a fault injection point, and if the interception time of the intercepted request is consistent with or matched with the fault injection time and the request object of the request is the same as the fault injection point, the simulation node determines that the intercepted request is a fault request;

the fault logic comprises the fault injection point, the fault injection time and a fault type, and the fault logic is used for indicating the simulation node to inject a fault corresponding to the fault type at the fault injection time aiming at the fault injection point so as to simulate the fault under the condition that the intercepted request is judged to be the fault request according to the fault request judgment rule.

2. The method of claim 1, wherein analyzing the service statistics and generating fault logic based on the analysis comprises:

and analyzing the service statistical data by adopting a preset data analysis model, and generating fault logic according to an analysis result.

3. The method of claim 1, wherein distributing the fault logic to analog nodes comprises:

distributing the fault logic to all simulation nodes with fault simulation capability;

or distributing the fault logic to the simulation node corresponding to the fault injection point.

4. The method of claim 1, wherein said simulating, by said simulation node, after the fault according to said fault logic, further comprises:

and destroying the fault instance corresponding to the fault logic by the simulation node.

5. A fault simulation device is characterized by comprising a processor and a transceiver,

the processor is used for acquiring service statistical data from services provided by a distributed system in a running state, wherein the service statistical data comprises flow statistical data, calling statistical information and system error information;

the processor is further configured to analyze the service statistical data and generate a fault logic according to an analysis result, where the fault logic is configured to simulate a fault;

the transceiver is used for distributing the fault logic to a simulation node, and the simulation node simulates the fault according to the fault logic;

before the simulation node simulates the fault according to the fault logic, the transceiver is also used for sending a fault request judgment rule to the simulation node, wherein the fault request judgment rule is used for the simulation node to judge whether the intercepted request is a fault request;

6. The fault simulation device according to claim 5, wherein the processor is configured to, when analyzing the service statistical data and generating the fault logic according to the analysis result, specifically, analyze the service statistical data by using a preset data analysis model and generate the fault logic according to the analysis result.

7. The fault simulation apparatus according to claim 5, wherein the transceiver is configured to distribute the fault logic to all simulation nodes with fault simulation capability when distributing the fault logic to the simulation nodes; or distributing the fault logic to the simulation node corresponding to the fault injection point.

8. The fault simulation apparatus of claim 5, wherein after the simulation node simulates the fault according to the fault logic, the simulation node destroys the fault instance corresponding to the fault logic.

9. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-4.