CN114609995A - Fault control method, device, system, equipment, medium and product - Google Patents

Fault control method, device, system, equipment, medium and product Download PDF

Info

Publication number
CN114609995A
CN114609995A CN202210209631.9A CN202210209631A CN114609995A CN 114609995 A CN114609995 A CN 114609995A CN 202210209631 A CN202210209631 A CN 202210209631A CN 114609995 A CN114609995 A CN 114609995A
Authority
CN
China
Prior art keywords
fault
drilling
configuration data
target
influence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210209631.9A
Other languages
Chinese (zh)
Inventor
肖国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Asiainfo Technology Nanjing Co ltd
Original Assignee
Asiainfo Technology Nanjing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Asiainfo Technology Nanjing Co ltd filed Critical Asiainfo Technology Nanjing Co ltd
Priority to CN202210209631.9A priority Critical patent/CN114609995A/en
Publication of CN114609995A publication Critical patent/CN114609995A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B23/00Testing or monitoring of control systems or parts thereof
    • G05B23/02Electric testing or monitoring
    • G05B23/0205Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults
    • G05B23/0218Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterised by the fault detection method dealing with either existing or incipient faults
    • G05B23/0256Electric testing or monitoring by means of a monitoring system capable of detecting and responding to faults characterised by the fault detection method dealing with either existing or incipient faults injecting test signals and analyzing monitored process response, e.g. injecting the test signal while interrupting the normal operation of the monitored system; superimposing the test signal onto a control signal during normal operation of the monitored system

Abstract

The embodiment of the application provides a fault control method, a fault control device, a fault control system, fault control equipment, a fault control medium and a fault control product, and relates to the field of computers. The method comprises the following steps: acquiring fault configuration data of a target fault; acquiring system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data; determining a corresponding impact factor based on the system resource configuration data and the fault configuration data; and determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not. The embodiment of the application ensures that the chaotic engineering is tested under the stable and controllable condition, and improves the safety of the chaotic engineering test.

Description

Fault control method, device, system, equipment, medium and product
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a system, a device, a medium, and a product for controlling a fault.
Background
With the evolution of a cloud native system, a micro-service framework can be used for performing subdivision management and control, flow recombination and arrangement aiming at services in the implementation process of the services to promote the services to be quickly online, but meanwhile, the system technology stacks are numerous, the system is complex in dependence, and the stability of the system is influenced.
In view of the above situation, in the current chaotic engineering, the problem of the system is found in advance by injecting a fault into the system, so that the purpose of considering the stability of the system is achieved, and the fault tolerance of the system can be improved. However, in the operation process of the chaotic engineering, because the subsequent influence caused by the fault injected in the production environment is unknown, the influence range of the chaotic engineering cannot be controlled, and certain safety risk exists in the experimental process.
Disclosure of Invention
The embodiment of the application provides a fault control method, a fault control device, a fault control system, a fault control device, a fault control medium and a fault control product, so that the chaotic engineering experiment is ensured to be carried out under the stable and controllable condition, and the safety of the chaotic engineering experiment is improved.
According to an aspect of an embodiment of the present application, there is provided a fault control method including:
acquiring fault configuration data of a target fault;
acquiring system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
determining a corresponding impact factor based on the system resource configuration data and the fault configuration data;
and determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not.
In one possible implementation manner, the fault influence information is used to indicate whether to initiate a fault drill corresponding to the target fault, and includes:
and when the fault risk value represented by the fault influence information exceeds a preset influence threshold value, sending an approval instruction to a user terminal, generating a corresponding drilling script based on the fault configuration data after receiving a confirmation instruction returned by the user terminal based on the approval instruction, and executing the drilling script to perform fault drilling aiming at the target fault.
In a possible implementation manner, the fault influence information is used to indicate whether to initiate a fault drill corresponding to the target fault, and further includes:
and when the fault risk value represented by the fault influence information does not exceed a preset influence threshold value, generating a corresponding drilling script based on the fault configuration data, and executing the drilling script to perform fault drilling aiming at the target fault.
In one possible implementation, the fault configuration data includes a fault type, a fault injection range, a fault service, and a fault policy; the acquiring of the fault configuration data of the target fault includes:
determining the fault type and fault service of the target fault;
sending a data selection instruction containing the fault type of the target fault and the fault service to the user terminal so that the user terminal returns to the fault injection range of the target fault after responding to the data selection instruction;
orchestrating the fault service based on the fault injection scope;
and configuring a fault strategy related to the arrangement result.
In one possible implementation, the method further includes:
acquiring fault drilling snapshot data of the target fault in a fault drilling process;
and updating an influence factor corresponding to the fault configuration data of the target fault based on the acquired fault drilling snapshot data.
According to another aspect of an embodiment of the present application, there is provided a fault control apparatus including:
the fault configuration module is used for acquiring fault configuration data of the target fault;
the data acquisition module is used for acquiring system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
the influence factor calculation module is used for determining corresponding influence factors based on the system resource configuration data and the fault configuration data;
and the fault influence range determining module is used for determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factors, wherein the fault influence information is used for indicating whether to initiate fault drilling corresponding to the target fault.
According to another aspect of an embodiment of the present application, there is provided a fault control system including a server including:
a chaotic controller configured to acquire fault configuration data of a target fault;
the resource collector is connected with the chaotic controller and is configured to receive fault configuration data sent by the chaotic controller and obtain system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
the minimum explosion radius controller is connected with the resource collector and the chaotic controller and is configured to receive the fault configuration data, the system resource configuration data and the fault drilling snapshot data sent by the resource collector; determining a corresponding impact factor based on the system resource configuration data and the fault configuration data; determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not and sending a corresponding control instruction to the chaotic controller;
and the fault drilling actuator is connected with the chaotic controller and is configured to respond to the control instruction forwarded by the chaotic controller to execute the fault drilling of the target fault.
According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory, wherein the processor executes the computer program to implement the steps of the fault control method according to the above embodiments.
According to another aspect of the embodiments of the present application, there is provided a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the fault control method of the above-described embodiments.
According to a further aspect of an embodiment of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the fault control method of the above-described embodiment.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
the method comprises the steps of obtaining fault configuration data of a target fault, obtaining system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data, determining a corresponding influence factor based on the system resource configuration data and the fault configuration data, and determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not, so that the fault influence range of chaotic engineering is controlled, the chaotic engineering is ensured to be tested under the stable and controllable conditions, the safety of chaotic engineering tests is improved, and the stability of the system is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a block diagram of a computer system according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a server according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a fault control method according to an embodiment of the present application;
FIG. 4 is an interaction diagram of a fault control system provided in accordance with another exemplary embodiment of the present application;
fig. 5 is a schematic structural diagram of a fault control device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a computer system architecture provided in an embodiment of the present application, where the computer system 1 includes a user terminal 20 and a server 10. The user terminal 20 and the server 10 are connected through a communication network, and the user terminal 20 and the server 10 may be directly or indirectly connected through wired or wireless communication, which is not limited in the present application.
The user terminal 20 may be any terminal device installed with an application program, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent wearable device, and the like, which is not limited in this embodiment. Regarding the hardware structure, the terminal 20 includes a display, a memory, a processor and an input device, but is not limited thereto. Illustratively, the application is a terminal-side application of the multimedia platform.
The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network, and a big data and artificial intelligence platform.
A multimedia platform is loaded on the server 10, and the server 10 provides background services for the application programs developed and running in the user terminal 20. In the present application, a user inputs relevant parameters defined for a fault, such as a target fault, a fault type, a fault service, and the like, through a multimedia platform loaded on a user terminal 20, so that the user terminal 20 obtains and stores the parameters input by the user. Secondly, after the server 10 determines the fault influence information of the target fault, if the fault risk value represented by the fault influence information exceeds a preset influence threshold, the server sends an approval instruction to the user terminal 20, at this time, the user terminal 20 is configured to initiate an approval process, such as an expert approval process, in response to the approval instruction, and when a relevant confirmation instruction is returned to the server after approval is passed, the server initiates a fault drill based on the confirmation instruction.
Fig. 2 is a schematic architecture diagram of a server according to an embodiment of the present application, in which the server 10 includes:
a chaotic controller 101 configured to acquire fault configuration data of a target fault;
the resource collector 102 is connected to the chaotic controller 101, and configured to receive fault configuration data sent by the chaotic controller 101 and obtain system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
a minimum explosion radius controller 103, connected to the resource collector 102 and the chaotic controller 101, and configured to receive the fault configuration data, the system resource configuration data, and the fault drilling snapshot data sent by the resource collector 102; determining a corresponding impact factor based on the system resource configuration data and the fault configuration data; determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not and sending a corresponding control instruction to the chaotic controller 101;
and a fault drilling actuator 104 connected to the chaotic controller 101 and configured to execute fault drilling of the target fault in response to the control command forwarded by the chaotic controller 101.
In some embodiments, the chaotic controller 101 and the minimum explosion radius controller 103 are both one or more processing units, which may be central processing units or network processors, and the chaotic controller 101 and the minimum explosion radius controller 103 are both connected to the user terminal 20. The chaotic controller 101 is configured to receive the relevant parameters defined for the fault and input by the user from the user terminal 20, and therefore, in some embodiments, the chaotic controller 101 is further configured to: the acquiring of the fault configuration data of the target fault includes: determining the fault type and fault service of the target fault; sending a data selection instruction containing the fault type of the target fault and fault service to the user terminal so that the user terminal returns to the fault injection range of the target fault after responding to the data selection instruction; orchestrating the fault service based on the fault injection scope; and configuring a fault strategy related to the arrangement result.
In some embodiments, the minimum detonation radius controller 103 is connected to the user terminal 20 to receive confirmation instructions related to the approval instructions sent by the user terminal 20. Accordingly, the minimum shot radius controller 103 is further configured to: and when the fault risk value represented by the fault influence information exceeds a preset influence threshold value, sending an approval instruction to a user terminal, generating a corresponding drilling script based on the fault configuration data after receiving a confirmation instruction returned by the user terminal based on the approval instruction, and executing the drilling script to perform fault drilling aiming at the target fault.
In some embodiments, the minimum detonation radius controller 103 is further configured to: and when the fault risk value represented by the fault influence information does not exceed a preset influence threshold value, generating a corresponding drilling script based on the fault configuration data, and executing the drilling script to perform fault drilling aiming at the target fault.
In some embodiments, the minimum detonation radius controller 103 is further configured to: acquiring fault drilling snapshot data of the target fault in a fault drilling process; and updating an influence factor corresponding to the fault configuration data of the target fault based on the acquired fault drilling snapshot data.
The fault control system provided by the embodiment of the application determines the influence factor of the target fault of the chaotic engineering based on the system resource configuration data from the resource collector and the fault configuration data from the chaotic controller by configuring the minimum explosion radius controller, so that corresponding fault drilling is initiated based on the influence factor and the fault drilling snapshot data indication, the fault influence range of the chaotic engineering is controlled, the chaotic engineering is ensured to be tested under the stable and controllable conditions, the safety of the chaotic engineering test is improved, and the stability of the system is improved.
Fig. 3 is a schematic flowchart of a fault control method according to an embodiment of the present application. The fault control method is used for the server 10 shown in fig. 1, and comprises steps S201 to S204.
S201, acquiring fault configuration data of the target fault.
In some embodiments, the fault configuration data includes a fault type, a fault injection scope, a fault service, and a fault policy; the acquiring of the fault configuration data of the target fault includes:
determining the fault type and fault service of the target fault;
sending a data selection instruction containing the fault type of the target fault and the fault service to the user terminal so that the user terminal returns to the fault injection range of the target fault after responding to the data selection instruction;
orchestrating the fault services based on the fault injection scope;
and configuring a fault strategy related to the arrangement result.
It should be noted that the fault service includes, but is not limited to, a microservice, and the fault type includes, but is not limited to, a hardware fault, a network fault, and a software fault, which is not limited in this application. Optionally, the user triggers and selects a fault type and a fault service of a target fault through the user terminal, the server generates and displays a plurality of related injection fault targets in response to a fault parameter selected by the user, so that a fault injection range corresponding to the determined fault type and fault service is selected from the plurality of injection fault targets, and the fault injection range is fed back to the server through the user terminal, where the fault injection range is specifically a range in which a chaotic engineering experiment is implemented, and includes at least one of an experiment target point and an experiment range for triggering chaotic engineering, where the experiment target point includes, but is not limited to, a container and an application framework, and the experiment range includes, but is not limited to, a machine, a device or a cluster for triggering the experiment. Optionally, the server may determine the fault injection range of the target fault adaptively according to the fault type of the target fault and the mapping relationship between data such as the data mapping table, and the method for determining the fault configuration data of the target fault is not limited in the present application.
A multimedia platform is loaded on the server 10, and the server 10 provides background services for the application programs developed and running in the user terminal 20. In this embodiment, a user inputs relevant parameters defined for a fault, such as a target fault, a fault type, a fault service, a fault injection range, and the like, through a multimedia platform loaded on the user terminal 20, so that the user terminal 20 obtains and stores the parameters input by the user, and sends the parameters to the server 10. In this way, the server performs service orchestration based on the fault type, the fault service, and the fault injection range of the target fault, and illustratively, coordinates data interaction between micro services for a preset fault execution flow, for example, to control operations, a service call sequence, and the like related to each micro service, so as to implement a fault event related to the fault type and the fault injection range. Furthermore, after the fault service arrangement is completed, corresponding fault strategies, such as fault emergency plans, fault repair strategies and the like, are configured to control the fault events in time.
It should be noted that the fault configuration data further includes related parameters of the chaotic engineering experiment model, and the related parameters of the chaotic engineering experiment model include an experiment target, an experiment range, an experiment rule matcher and experiment behaviors, where the experiment target is a component, such as a container, an application frame, etc., where a chaotic experiment occurs, and the experiment range is a machine or a cluster that triggers the chaotic experiment. The experiment rule matcher, i.e. the experiment matching conditions, is defined in the configured experiment target. Specifically, since each experimental target has a specific matching condition, one or more experimental rule matchers can be configured, such as Dubbo in the RPC field, which can match according to the service provided by the service provider and the service called by the service consumer, and Redis in the cache field, which can match according to set and get operations. The experimental behavior, namely the specific scene of the experimental simulation, is determined by the experimental target, and exemplarily, the experimental target is taken as a disk for example, so that the scenes of full disk, IO read-write of the disk and the like can be exercised; taking an application program as an example, experimental scenarios such as delay, exception, returning of specified values (error codes, large objects, and the like), parameter tampering, repeated calls, and the like can be abstracted. Therefore, after the relevant parameters of the chaotic engineering experiment model are determined, the target fault is injected into the system to test the response of the system to the target fault so as to carry out chaotic engineering experiments, so that the fault problem is identified and repaired by actively manufacturing the fault and testing the response behavior of the system under various pressures. In contrast, according to the chaotic engineering experiment model and the chaotic engineering experiment method, the relevant parameters of the chaotic engineering experiment model and the system resource configuration data are combined to determine the corresponding influence factors, and the fault influence condition of the target fault is estimated according to the influence factors and the fault drilling snapshot data, so that the fault drilling is executed by initiating the operations of fault approval, alarm notification, emergency plan formulation and the like, the control on the fault influence range of the chaotic engineering is realized, the chaotic engineering is ensured to be tested under the stable and controllable conditions, and the safety of chaotic engineering experiments is improved.
S202, obtaining system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data.
The resource allocation data includes, but is not limited to, application service information and K8S resource allocation data. Illustratively, the minimum explosion radius controller 103 is provided with a K8S resource acquisition module, an influence factor calculation module, a fault snapshot library, a decision module and a related management module. The K8S resource collection module is connected to the micro service registry in the server 10 to obtain application service information, such as micro service configuration information. In addition, the K8S resource collection module is connected to the kubernet API Server in the Server 10 to obtain the K8S resource configuration data. Specifically, the kubernet API Server provides an interface for data interaction and communication between other modules, and the K8S resource acquisition module connects the KS resource acquirer through the kubernet API Server to acquire K8S core resource data of each Pod, such as NODE, K8S management plane components, Service, Deployment, stateful, Pod, Pv, and other resource data.
The K8S resource collection module is connected to the fault drilling executor through a kubernet API Server to obtain fault drilling snapshot data of each fault drilling, where the fault drilling snapshot data is organized by information such as a fault scenario, a fault injection range, and exception information, and exemplarily, for a service delay and a large amount of traffic exception, the fault drilling snapshot data including fault scenario information, fault injection range information, fault influence range information, and exception information corresponding to the fault is recorded. The K8S resource collection module is connected to the fault snapshot repository so that the fault snapshot repository saves the fault drill snapshot data.
S203, determining corresponding influence factors based on the system resource configuration data and the fault configuration data.
In the present application, the influence factor is controlled by data such as system physical resources, K8S resources, and application services, and the influence factor is associated with the fault influence range and can reflect the fault influence range. The system resource configuration data and the fault configuration data form a one-to-one mapping relation with each influence factor, so that the server adapts the corresponding influence factors according to the acquired system resource configuration data and fault configuration data. For example, taking a fault injection range as a machine room network and a hydroelectric device as an example, since the fault injection range has a large influence surface, if a drilling is required, a disaster tolerance is required, and a supporting resource is required to perform scheme demonstration and configuration plan, the corresponding influence factor under the above condition can be set to 90. Taking a fault injection range as a Node (physical & virtual machine) as an example, and aiming at a management surface component Node with a specific experimental object of K8S, if a fault emergency plan and/or an emergency processing scheme needs to be configured when a fault is injected in the management Node, setting a corresponding influence factor to 80 under the condition; and aiming at other application nodes of a specific experimental object, setting corresponding influence factors as 100 × N/M, wherein N is the number of the nodes with fault injection, and M is the total number of the nodes. Taking the fault injection range as POD as an example, aiming at POD under the resource management of the Deployment, as the risk of stateless POD is smaller, setting the corresponding influence factor as 100 × N/M, wherein N is the number of nodes for injecting the fault, and M is the total number of nodes; for the POD under the StatefUlset resource management, as the POD with the state needs to be configured with a fault emergency plan, the corresponding influence factor is set to be 70. Taking a business service as an example, the fault type is a core service and the delay of a service call chain, and since a fault emergency plan needs to be configured and fault drilling needs to be performed in a relative idle time in this case, the corresponding impact factor is set to 50. Optionally, the impact factor supports configurable capability, and can be dynamically adjusted according to actual scenes, services, fault types, and the like.
Therefore, the method and the device have the advantages that the influence factors are configured, the correlation is generated between the environment resource configuration and the fault configuration of the chaotic engineering experiment, the correlation degree of the environment resource configuration and the fault configuration relative to the fault influence range can be accurately reflected, and the accuracy of calculating the fault influence information is further improved.
In some embodiments, the method further comprises:
acquiring fault drilling snapshot data of the target fault in a fault drilling process;
and updating an influence factor corresponding to the fault configuration data of the target fault based on the acquired fault drilling snapshot data.
In this embodiment, the fault drilling snapshot data generated in each fault drilling process is recorded, and the corresponding impact factor is automatically upgraded according to the fault drilling snapshot data, so that the impact factor is dynamically adjusted according to the actual fault operation condition, the impact factor configured in the next fault service orchestration can be more suitable for the actual condition, and the accuracy of the impact factor and the fault impact information can be improved.
S204, determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not.
Referring to fig. 4, which is an interaction schematic diagram of a fault control system according to another exemplary embodiment of the present application, for a chaos engineering experiment, a chaos controller 101 executes chaos service orchestration and sends a call instruction to a resource collector 102, so that the corresponding resource collector 102 collects K8S resources and configuration according to a fault type of a target fault in fault configuration data, so as to obtain system resource configuration data. Further, the resource acquirer 102 sends a call instruction to the minimum explosion radius controller 103 and sends acquired information, so that the minimum explosion radius controller 103 calculates fault influence information, i.e., an explosion radius, of the target fault according to the influence factor determined by the system resource configuration data and the fault drilling snapshot data. Further, the minimum explosion radius controller 103 returns the calculated fault influence information to the chaotic controller 101, so that the chaotic controller 101 invokes the fault drilling executor 104 (such as a chaotic execution engine) to execute the fault drilling of the target fault, and thus the fault drilling executor 104 returns relevant data of the fault drilling process, such as fault drilling snapshot data, execution results, and the like, to the chaotic controller 101.
In the application, the fault influence information of the target fault includes, but is not limited to, a fault influence range, a safety degree, a fault risk value and a fault influence level, so that the fault influence information of the target fault is dynamically calculated through fault drilling snapshot data and influence factors of the target fault, an execution suggestion is provided for fault drilling, the production system is maximally guaranteed to perform experiments under the assumption of stability, and the controllability of the fault range in chaotic engineering is realized. Further, the server compares the fault influence information of the target fault with a preset influence threshold value, and initiates a corresponding operation flow according to the comparison result.
In an embodiment, when the fault risk value represented by the fault influence information does not exceed a preset influence threshold, a corresponding drilling script is generated based on the fault configuration data, and the drilling script is executed to perform fault drilling for the target fault.
In this embodiment, when the fault risk value of the current target fault is lower than the preset influence threshold, the fault drilling executor is directly called, so that the drilling script generated by the fault configuration data is executed by using the fault drilling executor, and the fault drilling for the target fault is realized.
In another embodiment, when the fault risk value represented by the fault influence information exceeds a preset influence threshold, an approval instruction is sent to a user terminal, and after a confirmation instruction returned by the user terminal based on the approval instruction is received, a corresponding drilling script is generated based on the fault configuration data, and the drilling script is executed to perform fault drilling on the target fault.
In this embodiment, when the fault risk value of the current target fault exceeds the preset influence threshold, an approval process is initiated, for example, a process and an approval such as expert review and plan of high-level fault drilling are started, a corresponding approval instruction is generated, and meanwhile, corresponding warning information is sent and sent to the user terminal. In this way, after obtaining the confirmation instruction fed back by the user terminal, the fault drilling executor is called to execute the drilling script generated by the fault configuration data, so as to realize the fault drilling aiming at the target fault.
In another embodiment, when the fault risk value represented by the fault influence information exceeds a preset implementation threshold, generating alarm information and a regulation and control instruction to the user terminal, so that the user terminal executes operations such as adjusting the fault configuration data and configuring an emergency fault strategy; or, the current operation is stopped, and no fault drilling of the target fault is initiated.
Therefore, the fault influence range of the chaotic engineering is controlled by predicting the fault influence information of the target fault, different fault injection processes are started under the fault influence information, the production system is ensured to perform the experiment under the assumption of stability, and the safety of the experiment is improved.
On the basis of the above exemplary embodiment, the impact factor calculation module is connected to the K8S resource collection module and the decision module, and is configured to determine a corresponding impact factor based on the received system resource configuration data and fault configuration data, so as to send the determined impact factor to the decision module. And the fault snapshot library is connected with the decision module and used for transmitting fault drilling snapshot data acquired by each drilling. In this way, the decision module is configured to determine the corresponding impact factor based on the system resource configuration data and the fault configuration data, and update the impact factor based on the corresponding fault drilling snapshot data. Further, the decision module is connected to the fault drilling actuator and used for determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factors, and initiating corresponding fault drilling to the fault drilling actuator, so that the fault drilling actuator returns the fault drilling snapshot data in the fault drilling process.
In some embodiments, the chaotic engineering failure method further comprises:
determining data to be tested based on fault configuration data of the target fault and system resource configuration data;
and utilizing a system testing tool to carry out random testing on the data to be tested.
In this embodiment, the data to be tested may be an application corresponding to the software, the module unit, and the like of the target failure experiment, and the system testing tool includes a Monkey testing tool. Specifically, the fault drilling executor sends relevant data to be subjected to fault drilling, such as fault configuration data and system resource configuration data, to the Monkey testing tool, so that the Monkey testing tool determines the data to be tested based on the fault drilling data to perform Monkey testing, and the stability of the system application program under the fault injection condition is detected.
The fault control method provided by the embodiment of the application acquires system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data by acquiring the fault configuration data of a target fault, determines corresponding influence factors based on the fault drilling snapshot data and the influence factors, determines fault influence information of the target fault, and the fault influence information is used for indicating whether to initiate fault drilling corresponding to the target fault, so that the fault influence range of chaotic engineering is controlled, the chaotic engineering is ensured to be tested under stable and controllable conditions, the safety of chaotic engineering tests is improved, and the stability of the system is improved.
Fig. 5 is a schematic structural diagram of a fault control device according to an embodiment of the present application, where the fault control device 300 includes:
a fault configuration module 301, configured to obtain fault configuration data of a target fault;
a data obtaining module 302, configured to obtain system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
an influence factor calculation module 303, configured to determine a corresponding influence factor based on the system resource configuration data and the fault configuration data;
a fault influence range determining module 304, configured to determine fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, where the fault influence information is used to indicate whether to initiate a fault drilling corresponding to the target fault.
In some embodiments, the fault coverage determination module 304 includes:
and the first drilling unit is used for sending an approval instruction to the user terminal when the fault risk value represented by the fault influence information exceeds a preset influence threshold value, generating a corresponding drilling script based on the fault configuration data after receiving a confirmation instruction returned by the user terminal based on the approval instruction, and executing the drilling script to perform fault drilling aiming at the target fault.
In some embodiments, the fault coverage determination module 304 includes:
and the second drilling unit is used for generating a corresponding drilling script based on the fault configuration data when the fault risk value represented by the fault influence information does not exceed a preset influence threshold value, and executing the drilling script to perform fault drilling aiming at the target fault.
In some embodiments, the fault configuration data includes a fault type, a fault injection scope, a fault service, and a fault policy; the fault configuration module 301 comprises:
the fault type and fault service determining unit is used for determining the fault type and fault service of the target fault;
a fault injection range determining unit, configured to send a data selection instruction including a fault type of the target fault and a fault service to the user terminal, so that the user terminal returns a fault injection range of the target fault after responding to the data selection instruction;
the fault service arranging unit is used for arranging the fault service based on the fault injection range;
and the fault strategy configuration unit is used for configuring the fault strategy related to the arrangement result.
In some embodiments, the impact factor calculation module 303 includes:
the fault drilling snapshot data acquisition unit is used for acquiring fault drilling snapshot data of the target fault in the fault drilling process;
and the influence factor updating unit is used for updating the influence factor corresponding to the fault configuration data of the target fault based on the acquired fault drilling snapshot data.
The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed functional description of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the fault control method, and compared with the related art, the method can realize the following steps: the method comprises the steps of obtaining fault configuration data of a target fault, obtaining system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data, determining a corresponding influence factor based on the system resource configuration data and the fault configuration data, and determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not, so that the fault influence range of chaotic engineering is controlled, the chaotic engineering is ensured to be tested under the stable and controllable conditions, the safety of chaotic engineering tests is improved, and the stability of the system is improved. .
In an alternative embodiment, an electronic device is provided, as shown in fig. 6, the electronic device 4000 shown in fig. 6 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, and is not limited herein.
The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims (10)

1. A fault control method, comprising:
acquiring fault configuration data of a target fault;
acquiring system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
determining a corresponding impact factor based on the system resource configuration data and the fault configuration data;
and determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not.
2. The method of claim 1, wherein the fault impact information is used to indicate whether to initiate a fault drill corresponding to the target fault, and comprises:
and when the fault risk value represented by the fault influence information exceeds a preset influence threshold value, sending an approval instruction to a user terminal, generating a corresponding drilling script based on the fault configuration data after receiving a confirmation instruction returned by the user terminal based on the approval instruction, and executing the drilling script to perform fault drilling aiming at the target fault.
3. The method of claim 1, wherein the fault impact information is used to indicate whether to initiate a fault drill corresponding to the target fault, further comprising:
and when the fault risk value represented by the fault influence information does not exceed a preset influence threshold value, generating a corresponding drilling script based on the fault configuration data, and executing the drilling script to perform fault drilling aiming at the target fault.
4. The method of claim 1, wherein the fault configuration data includes fault type, fault injection scope, fault service, and fault policy; the acquiring of the fault configuration data of the target fault includes:
determining the fault type and fault service of the target fault;
sending a data selection instruction containing the fault type of the target fault and the fault service to the user terminal so that the user terminal returns to the fault injection range of the target fault after responding to the data selection instruction;
orchestrating the fault service based on the fault injection scope;
and configuring a fault strategy related to the arrangement result.
5. The method of claim 1, further comprising:
acquiring fault drilling snapshot data of the target fault in a fault drilling process;
and updating an influence factor corresponding to the fault configuration data of the target fault based on the acquired fault drilling snapshot data.
6. A fault control device, comprising:
the fault configuration module is used for acquiring fault configuration data of the target fault;
the data acquisition module is used for acquiring system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
the influence factor calculation module is used for determining corresponding influence factors based on the system resource configuration data and the fault configuration data;
and the fault influence range determining module is used for determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factors, wherein the fault influence information is used for indicating whether to initiate fault drilling corresponding to the target fault.
7. A fault control system comprising a server, the server comprising:
a chaotic controller configured to acquire fault configuration data of a target fault;
the resource collector is connected with the chaotic controller and is configured to receive fault configuration data sent by the chaotic controller and obtain system resource configuration data and fault drilling snapshot data corresponding to the fault configuration data;
the minimum explosion radius controller is connected with the resource collector and the chaotic controller and is configured to receive the fault configuration data, the system resource configuration data and the fault drilling snapshot data sent by the resource collector; determining a corresponding impact factor based on the system resource configuration data and the fault configuration data; determining fault influence information of the target fault based on the fault drilling snapshot data and the influence factor, wherein the fault influence information is used for indicating whether fault drilling corresponding to the target fault is initiated or not and sending a corresponding control instruction to the chaotic controller;
and the fault drilling actuator is connected with the chaotic controller and is configured to respond to the control instruction forwarded by the chaotic controller to execute the fault drilling of the target fault.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the fault control method according to any of claims 1-5.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the fault control method according to any one of claims 1 to 5.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the fault control method of any one of claims 1-5 when executed by a processor.
CN202210209631.9A 2022-03-04 2022-03-04 Fault control method, device, system, equipment, medium and product Pending CN114609995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210209631.9A CN114609995A (en) 2022-03-04 2022-03-04 Fault control method, device, system, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210209631.9A CN114609995A (en) 2022-03-04 2022-03-04 Fault control method, device, system, equipment, medium and product

Publications (1)

Publication Number Publication Date
CN114609995A true CN114609995A (en) 2022-06-10

Family

ID=81860869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210209631.9A Pending CN114609995A (en) 2022-03-04 2022-03-04 Fault control method, device, system, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN114609995A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168222A (en) * 2022-07-21 2022-10-11 北京同创永益科技发展有限公司 Method for producing lossless chaotic engineering experiment
CN115794529A (en) * 2022-12-06 2023-03-14 安超云软件有限公司 Storage fault simulation method, device, equipment and storage medium in cloud environment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9389961B1 (en) * 2014-09-30 2016-07-12 Veritas Technologies Llc Automated network isolation for providing non-disruptive disaster recovery testing of multi-tier applications spanning physical and virtual hosts
CN110308969A (en) * 2019-06-26 2019-10-08 深圳前海微众银行股份有限公司 Failure drilling method, device, equipment and computer storage medium
CN110765023A (en) * 2019-10-29 2020-02-07 中国工商银行股份有限公司 Distributed system testing method and system based on chaos experiment
US10684940B1 (en) * 2018-09-18 2020-06-16 Amazon Technologies, Inc. Microservice failure modeling and testing
CN111753751A (en) * 2020-06-28 2020-10-09 辽宁大学 Fan fault intelligent diagnosis method for improving firework algorithm
CN113010393A (en) * 2021-02-25 2021-06-22 北京四达时代软件技术股份有限公司 Fault drilling method and device based on chaotic engineering
CN113515449A (en) * 2021-05-19 2021-10-19 中国工商银行股份有限公司 Chaos test method, system, electronic equipment and storage medium
CN113935178A (en) * 2021-10-21 2022-01-14 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN114003428A (en) * 2021-11-04 2022-02-01 北京达佳互联信息技术有限公司 Fault early warning method and device for distributed system
CN114113984A (en) * 2021-11-29 2022-03-01 平安壹账通云科技(深圳)有限公司 Fault drilling method, device, terminal equipment and medium based on chaotic engineering

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9389961B1 (en) * 2014-09-30 2016-07-12 Veritas Technologies Llc Automated network isolation for providing non-disruptive disaster recovery testing of multi-tier applications spanning physical and virtual hosts
US10684940B1 (en) * 2018-09-18 2020-06-16 Amazon Technologies, Inc. Microservice failure modeling and testing
CN110308969A (en) * 2019-06-26 2019-10-08 深圳前海微众银行股份有限公司 Failure drilling method, device, equipment and computer storage medium
CN110765023A (en) * 2019-10-29 2020-02-07 中国工商银行股份有限公司 Distributed system testing method and system based on chaos experiment
CN111753751A (en) * 2020-06-28 2020-10-09 辽宁大学 Fan fault intelligent diagnosis method for improving firework algorithm
CN113010393A (en) * 2021-02-25 2021-06-22 北京四达时代软件技术股份有限公司 Fault drilling method and device based on chaotic engineering
CN113515449A (en) * 2021-05-19 2021-10-19 中国工商银行股份有限公司 Chaos test method, system, electronic equipment and storage medium
CN113935178A (en) * 2021-10-21 2022-01-14 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN114003428A (en) * 2021-11-04 2022-02-01 北京达佳互联信息技术有限公司 Fault early warning method and device for distributed system
CN114113984A (en) * 2021-11-29 2022-03-01 平安壹账通云科技(深圳)有限公司 Fault drilling method, device, terminal equipment and medium based on chaotic engineering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴冕冠;: "混沌工程的应用研究与探索", 中国金融电脑, no. 09, 7 September 2020 (2020-09-07) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168222A (en) * 2022-07-21 2022-10-11 北京同创永益科技发展有限公司 Method for producing lossless chaotic engineering experiment
CN115794529A (en) * 2022-12-06 2023-03-14 安超云软件有限公司 Storage fault simulation method, device, equipment and storage medium in cloud environment
CN115794529B (en) * 2022-12-06 2023-09-05 安超云软件有限公司 Storage fault simulation method, device and equipment in cloud environment and storage medium

Similar Documents

Publication Publication Date Title
CN114609995A (en) Fault control method, device, system, equipment, medium and product
CN108492005B (en) Project data processing method and device, computer equipment and storage medium
CN107807841B (en) Server simulation method, device, equipment and readable storage medium
CN114692169B (en) Page vulnerability processing method applying big data and AI analysis and page service system
CN112035344A (en) Multi-scenario test method, device, equipment and computer readable storage medium
CN111104158A (en) Software packaging method and device, computer equipment and storage medium
CN116090808A (en) RPA breakpoint reconstruction method and device, electronic equipment and medium
CN108376110A (en) A kind of automatic testing method, system and terminal device
CN114006815B (en) Automatic deployment method and device for cloud platform nodes, nodes and storage medium
CN112650689A (en) Test method, test device, electronic equipment and storage medium
CN116974874A (en) Database testing method and device, electronic equipment and readable storage medium
CN111159029A (en) Automatic testing method and device, electronic equipment and computer readable storage medium
CN116303069A (en) Test method, device, upper computer, system and medium of vehicle-mounted terminal
CN111597093A (en) Exception handling method, device and equipment
CN113869989A (en) Information processing method and device
CN113296825A (en) Application gray level publishing method and device and application publishing system
CN114691445A (en) Cluster fault processing method and device, electronic equipment and readable storage medium
CN115757088B (en) Fault injection method, device and equipment based on environment variable
CN113609145B (en) Database processing method, device, electronic equipment, storage medium and product
CN111338926A (en) Patch testing method and device and electronic equipment
CN112560035B (en) Application detection method, device, equipment and storage medium
CN113704016B (en) Cloud function component diagnosis method, device, equipment and storage medium
CN110737718A (en) Data backup method and device
CN117130945B (en) Test method and device
CN109445964B (en) Method and device for data transmission with SAP system in external system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination