CN115686913A - Application fault drilling method and system based on kubernetes cluster - Google Patents

Application fault drilling method and system based on kubernetes cluster Download PDF

Info

Publication number
CN115686913A
CN115686913A CN202211365891.1A CN202211365891A CN115686913A CN 115686913 A CN115686913 A CN 115686913A CN 202211365891 A CN202211365891 A CN 202211365891A CN 115686913 A CN115686913 A CN 115686913A
Authority
CN
China
Prior art keywords
task
experiment
controller
experimental
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211365891.1A
Other languages
Chinese (zh)
Inventor
梅金东
詹赵林
王畅
郭进
王鑫
孙佳明
谢瑒
梅洪彰
赵文川
黄文杰
刘金华
聂子璇
刘清
张绍兴
王汝珅
宫婷
侯宇
王浩文
鄢鹏
葛阳
刘小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial Bank Co Ltd
CIB Fintech Services Shanghai Co Ltd
Original Assignee
Industrial Bank Co Ltd
CIB Fintech Services Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial Bank Co Ltd, CIB Fintech Services Shanghai Co Ltd filed Critical Industrial Bank Co Ltd
Publication of CN115686913A publication Critical patent/CN115686913A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention provides an application fault drilling method and system based on a kubernets cluster, which relate to the technical field of cloud primitive and comprise the following steps: step S1: submitting task scheduling taskYml chaotic experimental data to an APIserver server, processing the task scheduling taskYml chaotic experimental data immediately after the task scheduling taskycord controller watch receives an event, and persisting the experimental data through etcd; step S2: the method comprises the steps that a controller, namely a ChaoController, analyzes an experimental object, selects a corresponding fault injection mode according to the operation type of the experimental object, and manages the life cycle of an experiment; and step S3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type. The method can realize that the abnormal fault state of some software or hardware is actively injected into the application under certain pressure, simulate the fault scene possibly encountered in the actual production and operation process, locate the factors influencing the stability of the system and improve the toughness of the distributed system.

Description

Application fault drilling method and system based on kubernetes cluster
Technical Field
The invention relates to the technical field of cloud protogenesis, in particular to an application fault drilling method and system based on a kubernetes cluster.
Background
In the native field of cloud, distributed application architectures deployed based on Kubernets are increasingly complex and unpredictable, and the like, the existing software testing strategy focuses on preventing predictable system risks, cannot adapt to the system stability requirement under the Kubernets environment, and actively detects application faults to become an effective solution.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an application fault drilling method and system based on a kubernets cluster.
According to the application fault drilling method and system based on the kubernets cluster, the scheme is as follows:
in a first aspect, a kubernets cluster-based application failure drilling method is provided, and the method includes:
task flow controller TaskController: controlling the life cycle of the drilling task;
controller, chaos controller: controlling the life cycle of the experiment, wherein the task object becomes a unit controlled by a ChaosController after being decomposed by a task flow controller TaskController;
the component Agent: for injecting a relevant anomaly disturbance for a specified application;
explosion radius control component exploposioncontroller: for controlling the explosion radius of the experiment, the current object was selected using the native tag-based selection capability of kubernets;
the implementation process of the method comprises the following steps:
step S1: submitting task scheduling taskYml chaotic experimental data to an APIserver server, processing the task scheduling taskYml chaotic experimental data immediately after the task scheduling taskycord controller watch receives an event, and persisting the experimental data through etcd;
step S2: the method comprises the steps that a controller, namely a ChaoController, analyzes an experimental object, selects a corresponding fault injection mode according to the operation type of the experimental object, and manages the life cycle of an experiment;
and step S3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type.
Preferably, the step S1 includes:
step S1.1: the task flow controller TaskController analyzes a task object according to the submitted task scheduling taskYml, stores data of the task object in a persistent mode, judges the type of the task flow according to the taskType in the metadata of the task object, and executes the step S1.2 and the step S1.3 if the task flow is a single experiment task flow;
step S1.2: packaging a single experiment chaosObject object according to the chaosotemplate of the task object, then submitting the chaosObject object to an APIserver server, then monitoring the Chaos object by a controller, executing a specific experiment, and feeding back an experiment state through a Reconfile mechanism;
step S1.3: executing the multi-experiment task, and judging whether the task object is executed in series or in parallel according to the task type;
step S1.4: and defining a timing task type through a Schedule in the chaosTemplate, if the task is executed for a single time, executing the task by adopting a Job mode, and if the task is executed for multiple times, executing the task by using a CronJob mode.
Preferably, the step S1.3 specifically includes:
if the task is executed serially, a single experiment chaosObject analyzed from a task object is input into a queue TaskQueue and then sequentially output from the queue for execution, the execution of the Chaos object is still controlled by a controller chaosController and a result is fed back, and after the execution of the experiment is finished, a task flow controller is informed in a callback mode, the TaskController caches the number of executed experiments, and then the number is compared with the total number of the task object, so that the execution state of the task is judged;
if the execution is parallel, the execution is performed in the manner of step S1.2.
Preferably, the step S2 includes:
after capturing a chaosObject object, the controller ChaosController analyzes main data of the object and determines an injection mode according to the value of the chaosType;
for fault injection of a non-lifecycle management class, combining AgentObject objects according to the taskTarget and action contents, and then confirming the information of the node where the AgentObject objects are located according to the experimental action object information, so that the experimental data is sent to a specific Agent program; monitoring an Agent execution result through Reconfiguration, and sending a recovery instruction to the Agent after the experiment is finished;
for the experimental actions of the Pod life cycle management class, fault injection is realized by calling a Pod related API of the API server, after injection is completed, the completion state of the experiment is verified by monitoring Pod time, then a ChaosController updates the state of the task, and the task is completed once.
Preferably, the step S3 includes:
step S3.1: the Agent analyzes the AgentObject to obtain experimental data, analyzes the experimental data, judges whether the experiment is directed at the node where the Pod is located or the Pod, and then realizes a specific experiment;
step S3.2: according to different submitted experiment types, the Agent executes corresponding logic;
step S3.3: and when the duration of the fault injection link is finished, destroying the Agent chaotically injected program, exiting the namespaces of the fault-injected application Pod, and then sending the experiment result to the controller ChaosController.
Preferably, said step S3.1 further comprises: if the experiment object is aimed at the Pod, the experiment object is operated by entering namespaces of the Pod through attachNS () function of the Agent, and if the experiment object is aimed at the node Nodes, the experiment injection is directly completed at the node.
Preferably, said step S3.2 further comprises: and calling an injectFault () function to inject an experiment according to the experiment object and the experiment type, then calling back a return result, and determining the injection mode of the injectFault () function according to the type.
In a second aspect, a kubernets cluster-based application failure drilling system is provided, the system comprising:
task flow controller TaskController: controlling the life cycle of the drilling task;
controller, chaosController: controlling the life cycle of the experiment, wherein the task object can become a unit controlled by a ChaosController after being decomposed by a task flow controller TaskController;
the component Agent: for injecting a relevant anomaly disturbance for a specified application;
explosion radius control component exploposioncontroller: for controlling the explosion radius of the experiment, the current subject was selected using the native tag-based selection capability of kubernets;
the system comprises:
a module M1: submitting task scheduling taskYml chaotic experimental data to an APIserver server, processing the task scheduling taskYml chaotic experimental data immediately after the task scheduling taskycord controller watch receives an event, and persisting the experimental data through etcd;
a module M2: the method comprises the steps that a controller, namely a ChaoController, analyzes an experimental object, selects a corresponding fault injection mode according to the operation type of the experimental object, and manages the life cycle of an experiment;
a module M3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type.
Preferably, said module M1 comprises:
module M1.1: the task flow controller TaskController analyzes a task object according to the submitted task scheduling taskYml, stores data of the task object in a persistent mode, judges the type of the task flow according to the taskType in the metadata of the task object, and executes a module M1.2 and a multi-experiment task flow executing module M1.3 if the task flow is a single experiment task flow;
module M1.2: packaging a single experiment chaosObject object according to the chaosotemplate of the task object, then submitting the chaosObject object to an APIserver server, then monitoring the Chaos object by a controller, executing a specific experiment, and feeding back an experiment state through a Reconfile mechanism;
module M1.3: executing a multi-experiment task, and judging whether a task object is executed in series or in parallel according to the task type;
module M1.4: the type of the experimental task object in the module M1.2 and the module M1.3 is a timing task type, if the task is executed for a single time, the task is executed in a Job mode, and if the task is executed for multiple times, the task is executed in a CronJob mode;
the module M1.3 specifically comprises:
if the task is executed serially, a single experiment chaosObject analyzed from a task object is input into a queue TaskQueue and then sequentially output from the queue for execution, the execution of the Chaos object is still controlled by a controller chaosController and a result is fed back, and after the execution of the experiment is finished, a task flow controller is informed in a callback mode, the TaskController caches the number of executed experiments, and then the number is compared with the total number of the task object, so that the execution state of the task is judged;
if the execution is parallel, the execution is carried out by using the module M1.2.
Preferably, said module M2 comprises:
after capturing a chaosObject object, the controller ChaosController analyzes main data of the object and determines an injection mode according to the value of the chaosType;
for fault injection of a non-lifecycle management class, combining AgentObject objects according to the taskTarget and action contents, and then confirming the information of the node where the AgentObject objects are located according to the experimental action object information, so that the experimental data is sent to a specific Agent program; monitoring an Agent execution result through Reconfiguration, and sending a recovery instruction to the Agent after the experiment is finished;
for the experimental actions of the Pod life cycle management class, fault injection is realized by calling Pod related API of API server, after injection is completed, the completion state of the experiment is verified by monitoring Pod time, then a ChaosController updates the state of the task, and the task is completed once;
the module M3 comprises:
module M3.1: the Agent analyzes the AgentObject to obtain experimental data, analyzes the experimental data, judges whether the experiment is directed at the node where the Pod is located or directed at the Pod, and then realizes a specific experiment;
module M3.2: according to different submitted experiment types, the Agent executes corresponding logic;
module M3.3: when the duration of the fault injection link is finished, destroying the Agent chaotically injected program, exiting the namespaces of the fault-injected application Pod, and then sending an experiment result to a controller, namely a CharoController;
the module M3.1 further comprises: if the node is directed to the Pod, the experimental object is operated by entering namespaces of the Pod through an attachNS () function of the Agent, and if the node is directed to the node, the experimental injection is directly completed at the node;
the module M3.2 further comprises: and calling an injectFault () function to inject an experiment according to the experiment object and the experiment type, then calling back a return result, and determining the injection mode of the injectFault () function according to the type.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention mainly aims at a fault drilling system designed by a Kubernets cluster. The kubernets infrastructure is reused as much as possible to realize the function of applying fault drilling, such as: defining some fault types by using an operator plug-in mechanism of Kubernetes and realizing a corresponding controller; and carrying out classification management on the fault injection test, designing a CRD (critical connection device) independently for each type of fault, and realizing full life cycle management on the fault injection by using a ChaosController controller. The Kubeconfig configuration file mode is adopted to manage the information of the clusters, and the fault injection capability of the multi-cluster application is realized. Through the method, the deployment and use difficulty is reduced;
2. the invention provides a fault active injection scheme based on Kubernets, different implementation assemblies are designed for different types of fault injection according to the characteristics and the method applicability of actual fault injection, and the data transmission between the assemblies, a controller and the like is realized by virtue of the self capacity of the Kubernets;
3. the invention realizes a multi-dimensional explosion radius controller, controls a target experimental object by a user-defined label and a network strategy mode, realizes different directions of fault injection from the dimension design of an influence range after fault injection, and limits the influence range of the fault;
4. the invention provides the capability of a work task flow and an experiment scheduling so as to achieve the ordered injection simulation of multi-fault and non-preemptive resources.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of an overall frame;
FIG. 2 is a diagram of a data structure packet;
FIG. 3 is a schematic diagram of a TaskQueue;
FIG. 4 is a schematic diagram of Schedule;
FIG. 5 is a functional diagram of an Agent performing fault injection;
FIG. 6 is a logic diagram of the processing fault of the ExplosionController.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the invention.
The embodiment of the invention provides an application fault drilling method based on a kubernets cluster, which is shown in a reference figure 1, wherein related main components and functional designs are as follows:
1) Task flow controller TaskController: the TaskController controller controls the life cycle of a drilling task, arranges and controls a chaotic drilling task through an expanded API resource object TaskFolw, a task flow comprises one or more experiments, the experiments are executed in a parallel or serial mode, fault experiments in the chaOSList can be executed according to different modes, and whether follow-up experiments depend on a preamble to be continuously executed or not is determined through a set feedback mode.
The TaskController comprises three core designs, namely a TaskFlow data structure, a TaskQueue task queue and Schedule scheduling. The latter two modules are not adopted by single experimental tasks and tasks not related to timing scheduling.
Referring to fig. 2, the TaskFlow defines a task flow structure, and is designed by using a standard kubernetese api object, where the data structure includes metadata and chaotemplate experimental object data, and mainly includes: task identification, experiment list, scheduling type, action object, experiment action and the like.
Referring to fig. 3, the TaskQueue is a task flow first-in first-out queue, a plurality of experiments executed in series, a task first-in queue, and a task first-out queue after execution is completed, and a task of a single experiment does not enter the queue. The experiments in the chaosList executed by the multi-test task in series are taken as a whole to enter a TaskQueue queue, the termination of the experiments is carried out by taking the task as a unit, namely the task is ended, the experiments are automatically ended, and then all the experiments are out of the queue.
Referring to fig. 4, schedule defined by chaosTemplate includes two types of experimental object schedules, namely periodic Schedule and aperiodic Schedule, which are implemented based on corjob and Job, and listening of results is implemented by using a Reconcile mechanism.
2) Controller, chaosController: the chaos experiment controller designed by adopting the operator mode of kubernets is the only component capable of controlling the life cycle of an experiment, and a task object becomes a unit controlled by a chaos controller after being decomposed by the TaskController. The controller can adopt different modes to complete the experiment according to the specific type of the experiment, three modes are realized in the design of the scheme, one mode calls an API facing to the Pod, such as delete, edit, patch and the like, the second mode is to inject a Sidecar container when the Pod is created by designing a ChaosWebhook mode, the fault is injected through the container, the third mode is to encapsulate a fault injection program into Agent which is deployed in the cluster in a DaemonSet mode, and the non-invasive fault injection capability of the application is realized through the mode of sharing the container namespaces by utilizing the characteristics of the container. After the fault injection is completed, the controller senses the experiment result through a Reconfile feedback mechanism and controls the complete life cycle of an experiment.
3) The component Agent: the Agent simulation fault injection component is used for injecting related abnormal interference to a specified application, comprises faults of software and hardware levels, and is deployed at each node of the cluster in a daemon mode of kubernets. Agents encapsulate failure handling logic, and updating the logic requires resetting the Pod of the Agent. The Agent provides service in a form of externally exposing ports, shields bottom layer implementation details, can realize non-inductive updating of a fault injection implementation mode, mainly provides 4 methods, realizes fault injection, obtains results, terminates an experiment, and enters namespaces, wherein the namespaces entering method comprises the following steps: attachNS (nsId), experimental infusion method: injtfault (targetId, type, x), getcause (targetId, x), stopflag (targetId, x).
The Agent component implementation method of the scheme mainly comprises the following steps: and encapsulating a calling interface for operating Cgroup and Namespaces of a Linux container, a Java method level bytecode injection tool Byteman, a host machine pressure, TC network flow control and other script tools capable of injecting faults into a Pod level, an application method level and a host machine level. The Agent performs the fault injection function as shown with reference to fig. 5.
4) Explosion radius control component exploposioncontroller: the ExplosionController component is used for controlling the explosion radius of an experiment, realizes the selection of the current object by utilizing the original tag-based selection capability of Kubernets, and provides the experimental object positioning capability based on a plurality of granularities such as Pod, label, service and the like; for the simulation of the abnormal network flow fault scene, the component supports the control of the south-north dimension of the flow, simulates the scene of the network fault occurring at the request end and the service end, and combines the label and the network policy of the identification experiment arrangement object to realize the accurate control of the explosion radius.
The south-north dimension controls the fault injection object, taking a request of a for service as an example, the south dimension injects the fault into B, the fault of B affects the response of the request of a, the north dimension injects the fault into a, and the occurrence of the fault is simulated by delaying the request of a.
The ExplosionController controller implements target object location based on an object selector, with three granularities, pod, label, and Service, ranging from small to large, where Service selects injected objects from a Service perspective, pod is from a single instance perspective, and Label is from a metadata-based perspective. If the Service is used to select the target injection object, the application controller analyzes the backend Endpoint object of the Service, queries an effective backend Pod object, and then uses the Pod selector to perform fault injection on a specific instance.
The processing fault logic of the ExplosionController is shown with reference to FIG. 6.
Specifically, the implementation process of the method includes:
step S1: and submitting a task scheduling taskYml chaotic experimental data to an APIserver server, wherein the TaskController controller watch processes the data immediately after the event and the etcd can persist the data.
Specifically, the step S1 specifically includes:
step S1.1: the task flow controller TaskController analyzes a task object according to the submitted task Yml, stores the object data persistently, judges the type of the task flow according to the task type in the metadata of the task object, and executes the step S1.2 and the step S1.3 according to the multi-experiment task flow if the task flow is a single experiment task flow.
Step S1.2: and packaging the chaosyObject object of the single experiment according to the chaosytemplate of the task object, then submitting the chaosyObject object to an APIserver server, then monitoring the Chaos object by a controller, executing a specific experiment, and then feeding back the experiment state through a Reconfile mechanism.
Step S1.3: and (4) executing the multi-experiment task, and judging whether the task object is executed in series or in parallel according to the task type. If the Task is executed serially, a single experiment chaosObject analyzed from the Task is input into a queue TaskQueue and then is sequentially output from the queue for execution, the execution of the chaos is still controlled by a ChaosController and the result is fed back, and after the execution of the experiment is finished, the TaskController is informed in a callback mode, caches the number of executed experiments, and then compares the number with the total number of the tasks to judge the execution state of the Task. If the execution is parallel, the step S1.2 mode is used for execution.
Step S1.4: schedule in the stoste template of the task object in step S1.2 and step S1.3 defines the type of the timed task, if the task is executed once, the timed task is executed in a Job mode, and if the task is executed for multiple times, the timed task is executed in a cron Job mode, and the two modes are both based on the capability of kubernets.
Step S2: and the controller, chaoController, analyzes the experimental object, and then selects a corresponding fault injection mode according to the operation type of the object to manage the life cycle of the experiment.
The step S2 specifically includes: after capturing the chaosObject object, the controller, charoController, analyzes the main data of the object, including a charoType, a charoTarget, an action, and a targetSector, and these 4 objects contain the dependency information of the fault injection type. According to the value of chaosType, the injection modes are determined, namely Pod lifecycle management type, daemon shared namesapcs type injection and sidecar injection. For fault injection of a non-lifecycle management class, then combining AgentObject objects according to the taskTarget and action content, and then confirming the information of the nodes according to the experimental action object information, thereby sending the experimental data to a specific Agent program; and monitoring the execution result of the Agent through Reconfile, and sending a recovery instruction to the Agent after the experiment is finished.
For the experimental actions of the Pod life cycle management class, fault injection is realized by calling the POD related API of the API server, after the injection is completed, the completion state of the experiment is verified by monitoring the Pod time, then the state of the task is updated by the ChaosController, and the task is completed once.
And step S3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type.
Specifically, step S3 includes:
step S3.1: the Agent analyzes AgentObject, acquires experiment data, analyzes the experiment data, judges whether the experiment is directed at the node where the Pod is located or directed at the Pod, then realizes a specific experiment, enters namespaces of the Pod through an attachNS () function of the Agent to operate an experiment object if the experiment is directed at the Pod, and directly completes experiment injection at the node if the experiment is directed at the node Nodes.
Step S3.2: according to different submitted experiment types, the Agent executes corresponding logic; firstly, calling an injectFault () function to inject an experiment according to an experiment object and an experiment type, then calling back a return result, wherein the injection mode of the injectFault () is determined according to the type, for example, adding delay to a specified method of a Java program is executed, and byte code level delay script injection is carried out by using a Byteman component based on a fault injection technology to realize method level delay; executing fault injection of the resource occupation class starts encapsulated programs such as stress pressure tools for implementation. During the experiment, stoppault () can be received to terminate an experiment, and the result of the experiment is obtained through getResult ().
Step S3.3: and when the duration of the fault injection link is finished, destroying the Agent chaotically injected program, exiting the namespaces of the fault-injected application Pod, and then sending the experiment result to the controller ChaosController.
The technical personnel in the field can understand the application fault practicing method based on the kubernets cluster provided by the invention into a specific implementation mode of the application fault practicing system based on the kubernets cluster, namely the application fault practicing system based on the kubernets cluster can be realized by executing the step flow of the application fault practicing method based on the kubernets cluster.
The embodiment of the invention provides an application fault drilling method and system based on a kubernets cluster, which can realize that abnormal fault states of some software or hardware are actively injected into an application under certain pressure, simulate fault scenes possibly encountered in the actual production and operation process of the application, locate factors influencing the stability of a system and improve the toughness of a distributed system.
The invention mainly aims at a fault drilling system designed by a Kubernets cluster. Kubernets infrastructure is reused as much as possible to realize the function of application fault drilling, such as: defining some fault types by using an operator plug-in mechanism of Kubernetes and realizing a corresponding controller; and carrying out classification management on the fault injection test, designing a CRD (critical connection device) independently for each type of fault, and realizing full life cycle management on the fault injection by using a ChaosController controller. The fault injection capability of the multi-cluster application is realized by managing the information of the plurality of clusters in a Kubeconfig configuration file mode. Through the mode, the deployment and use difficulty is reduced.
The invention provides a fault active injection scheme realized based on Kubernets, different realization assemblies are designed for different types of fault injection according to the characteristics and the method applicability of the actual fault injection, and the data transmission between the assemblies and a controller and the like is realized by means of the capacity of the Kubernets. The invention realizes a multi-dimensional explosion radius controller, controls a target experimental object by a user-defined label and a network strategy mode, realizes different directions of fault injection from the dimension design of the influence range after fault injection, and limits the influence range of the fault. The invention provides a work task flow and experiment scheduling capability to achieve the ordered injection simulation of multi-fault and non-preemptive resources.
It is well within the knowledge of a person skilled in the art to implement the system and its various devices, modules, units provided by the present invention in a purely computer readable program code means that the same functionality can be implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for realizing various functions can also be regarded as structures in both software modules and hardware components for realizing the methods.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. An application fault drilling method based on a kubernets cluster is characterized by comprising the following steps:
task flow controller TaskController: controlling the life cycle of the drilling task;
controller, chaosController: controlling the life cycle of the experiment, wherein the task object can become a unit controlled by a ChaosController after being decomposed by a task flow controller TaskController;
the component Agent: for injecting a relevant anomaly disturbance for a specified application;
explosion radius control component exploposioncontroller: for controlling the explosion radius of the experiment, the current subject was selected using the native tag-based selection capability of kubernets;
the implementation process of the method comprises the following steps:
step S1: submitting task planning taskYml chaotic experimental data to an APIserver server, processing the task planning taskYml chaotic experimental data after an event is reached by a TaskController task flow controller, and persisting the experimental data through etcd;
step S2: the method comprises the steps that a controller, namely a ChaoController, analyzes an experimental object, selects a corresponding fault injection mode according to the operation type of the experimental object, and manages the life cycle of an experiment;
and step S3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type.
2. The kubernets cluster-based application failure rehearsal method of claim 1, wherein the step S1 comprises:
step S1.1: the task flow controller TaskController analyzes a task object according to the submitted task layout TaskYml, persistently stores data of the task object, judges the type of a task flow according to the taskType in metadata of the task object, and executes a step S1.2 and a step S1.3 for a multi-experiment task flow if the task flow is a single experiment task flow;
step S1.2: packaging a single experimental chaosObject object according to the chaosTemplate of the task object, then submitting the chaosObject object to an APIserver server, then monitoring the Chaos object by a controller, executing a specific experiment, and feeding back an experiment state through a Reconfile mechanism;
step S1.3: executing a multi-experiment task, and judging whether a task object is executed in series or in parallel according to the task type;
step S1.4: and defining a timing task type through a Schedule in the chaosTemplate, if the task is executed for a single time, executing the task by adopting a Job mode, and if the task is executed for multiple times, executing the task by using a CronJob mode.
3. The kubernets cluster-based application failure rehearsal method of claim 2, wherein the step S1.3 specifically comprises:
if the task is executed serially, a single experiment chaosObject analyzed from a task object is input into a queue TaskQueue and then sequentially output from the queue for execution, the execution of the Chaos object is still controlled by a controller chaosController and a result is fed back, and after the execution of the experiment is finished, a task flow controller is informed in a callback mode, the TaskController caches the number of executed experiments, and then the number is compared with the total number of the task object, so that the execution state of the task is judged;
if the execution is parallel, the execution is performed in the manner of step S1.2.
4. The kubernets cluster-based application failure rehearsal method of claim 1, wherein the step S2 comprises:
after capturing a chaosObject object, the controller ChaosController analyzes main data of the object and determines an injection mode according to the value of the chaosType;
for fault injection of a non-lifecycle management class, according to the taskTarget and action content, an AgentObject object is combined, and then node information where the AgentObject object is located is confirmed according to experimental action object information, so that the experimental data is sent to a specific Agent program; monitoring an Agent execution result through Reconfiguration, and sending a recovery instruction to the Agent after the experiment is finished;
for the experimental actions of the Pod life cycle management class, fault injection is realized by calling a Pod related API of the API server, after injection is completed, the completion state of the experiment is verified by monitoring Pod time, then a ChaosController updates the state of the task, and the task is completed once.
5. The kubernets cluster-based application failure rehearsal method of claim 1, wherein the step S3 comprises:
step S3.1: the Agent analyzes the AgentObject to obtain experimental data, analyzes the experimental data, judges whether the experiment is directed at the node where the Pod is located or directed at the Pod, and then realizes a specific experiment;
step S3.2: according to different submitted experiment types, the Agent executes corresponding logic;
step S3.3: and when the duration time of the fault injection link is over, destroying the Agent chaotically injected program, exiting the namespaces of the fault-injected application Pod, and then sending an experimental result to the controller ChaosController.
6. The kubernets-cluster-based application failure rehearsal method of claim 5, wherein the step S3.1 further comprises: if the experiment object is aimed at the Pod, the experiment object is operated by entering namespaces of the Pod through attachNS () function of the Agent, and if the experiment object is aimed at the node Nodes, the experiment injection is directly completed at the node.
7. The kubernets-cluster-based application failure rehearsal method of claim 5, wherein the step S3.2 further comprises: and calling an injectFault () function to inject an experiment according to the experiment object and the experiment type, then calling back a return result, and determining the injection mode of the injectFault () function according to the type.
8. An application fault drilling system based on a kubernets cluster, comprising:
task flow controller TaskController: controlling the life cycle of the drilling task;
controller, chaos controller: controlling the life cycle of the experiment, wherein the task object becomes a unit controlled by a ChaosController after being decomposed by a task flow controller TaskController;
the component Agent: for injecting a relevant exception disturbance to a specified application;
explosion radius control component exploposioncontroller: for controlling the explosion radius of the experiment, the current subject was selected using the native tag-based selection capability of kubernets;
the system comprises:
a module M1: submitting task scheduling taskYml chaotic experimental data to an APIserver server, processing the task scheduling taskYml chaotic experimental data immediately after the task scheduling taskycord controller watch receives an event, and persisting the experimental data through etcd;
a module M2: the method comprises the steps that a controller, namely a ChaoController, analyzes an experimental object, selects a corresponding fault injection mode according to the operation type of the experimental object, and manages the life cycle of an experiment;
a module M3: and executing predetermined logic by the simulation fault injection component Agent according to the fault injection type.
9. The kubernets-cluster-based application failure rehearsal system of claim 8, wherein the module M1 comprises:
module M1.1: the task flow controller TaskController analyzes a task object according to the submitted task scheduling taskYml, stores data of the task object in a persistent mode, judges the type of the task flow according to the taskType in the metadata of the task object, and executes a module M1.2 and a multi-experiment task flow executing module M1.3 if the task flow is a single experiment task flow;
module M1.2: packaging a single experiment chaosObject object according to the chaosotemplate of the task object, then submitting the chaosObject object to an APIserver server, then monitoring the Chaos object by a controller, executing a specific experiment, and feeding back an experiment state through a Reconfile mechanism;
module M1.3: executing the multi-experiment task, and judging whether the task object is executed in series or in parallel according to the task type;
module M1.4: defining a timing task type through a Schedule in the chaosTemplate, if the task is executed for a single time, executing the task by adopting a Job mode, and if the task is executed for multiple times, executing the task by using a CronJob mode;
the module M1.3 specifically comprises:
if the task execution is serial execution, a single experiment chaosObject analyzed from a task object is input into a queue TaskQueue, then the queue execution is sequentially output, the execution of the Chaos object is still controlled by a controller ChaosController and a result is fed back, each time the experiment is executed, a task flow controller is informed in a callback mode, the TaskController caches the number of executed experiments, and then the number is compared with the total number of the task object, so that the execution state of the task is judged;
if the execution is parallel, the execution is carried out by using the module M1.2.
10. The kubernets-cluster-based application failure rehearsal system of claim 8, wherein the module M2 comprises:
after capturing a chaosObject object, the controller ChaosController analyzes main data of the object and determines an injection mode according to the value of the chaosType;
for fault injection of a non-lifecycle management class, combining AgentObject objects according to the taskTarget and action contents, and then confirming the information of the node where the AgentObject objects are located according to the experimental action object information, so that the experimental data is sent to a specific Agent program; monitoring the Agent execution result through Reconfigure, and sending a recovery instruction to the Agent after the experiment is finished;
for the experimental actions of the Pod life cycle management class, fault injection is realized by calling Pod related API of API server, after injection is completed, the completion state of the experiment is verified by monitoring Pod time, then a ChaosController updates the state of the task, and the task is completed once;
the module M3 comprises:
module M3.1: the Agent analyzes the AgentObject to obtain experimental data, analyzes the experimental data, judges whether the experiment is directed at the node where the Pod is located or directed at the Pod, and then realizes a specific experiment;
module M3.2: according to different submitted experiment types, the Agent executes corresponding logic;
module M3.3: when the duration time of a fault injection link is over, destroying a program of Agent chaotic injection, exiting namespaces of the application Pod injected with the fault, and then sending an experimental result to a controller, namely a ChaosController;
the module M3.1 further comprises: if the node is directed to the Pod, the experimental object is operated by entering namespaces of the Pod through an attachNS () function of the Agent, and if the node is directed to the node, the experimental injection is directly completed at the node;
the module M3.2 further comprises: and calling an injectFault () function to inject an experiment according to the experiment object and the experiment type, then calling back a return result, and determining the injection mode of the injectFault () function according to the type.
CN202211365891.1A 2022-10-27 2022-10-31 Application fault drilling method and system based on kubernetes cluster Pending CN115686913A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2022113256650 2022-10-27
CN202211325665 2022-10-27

Publications (1)

Publication Number Publication Date
CN115686913A true CN115686913A (en) 2023-02-03

Family

ID=85047617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211365891.1A Pending CN115686913A (en) 2022-10-27 2022-10-31 Application fault drilling method and system based on kubernetes cluster

Country Status (1)

Country Link
CN (1) CN115686913A (en)

Similar Documents

Publication Publication Date Title
CN107273286B (en) Scene automatic test platform and method for task application
CN111147555B (en) Heterogeneous resource mixed arrangement method
CN108845940B (en) Enterprise-level information system automatic function testing method and system
Huang et al. Modeling and verifying the code-level OSEK/VDX operating system with CSP
CN102354165B (en) Method for updating data online and device therefor
CN105468378A (en) Software continuous integration configuration management method and system
CN109298868A (en) Intelligent dynamic deployment and unloading method for mapping image data processing software
CN114818565A (en) Simulation environment management platform, method, equipment and medium based on python
CN113364515A (en) Satellite remote control method, device, equipment and storage medium based on Xstate
US7131080B2 (en) Simulation management system
CN113254054B (en) Intelligent contract one-stop development system and method
CN114064503A (en) UI automation test method and device, electronic equipment and storage medium
CN115686913A (en) Application fault drilling method and system based on kubernetes cluster
CN114006815A (en) Automatic deployment method and device for cloud platform nodes, nodes and storage medium
KR20210039714A (en) Method and apparatus for constructing test environment
CN113590494B (en) Automatic testing method for cloud native environment vulnerability
CN112231231B (en) Cloud service debugging method, system and device
CN114675948A (en) DAG data model dynamic scheduling method and system
CN115408110B (en) Performance evaluation method and system for Kubernetes control plane component
CN117539605B (en) Data processing program assembling method, device, equipment and storage medium
CN113568627B (en) Application program deployment method and system
CN111240920B (en) Performance test method, device, server and storage medium
CN111984523B (en) Message processing task testing method, device, system, equipment and medium
CN114070764B (en) Network function virtualization NFV test method, device and system
CN115348186A (en) Method, device and storage medium for managing security reference in container environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination