CN113010393A - Fault drilling method and device based on chaotic engineering - Google Patents

Fault drilling method and device based on chaotic engineering Download PDF

Info

Publication number
CN113010393A
CN113010393A CN202110215213.6A CN202110215213A CN113010393A CN 113010393 A CN113010393 A CN 113010393A CN 202110215213 A CN202110215213 A CN 202110215213A CN 113010393 A CN113010393 A CN 113010393A
Authority
CN
China
Prior art keywords
fault
test
result data
resource map
performance analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110215213.6A
Other languages
Chinese (zh)
Inventor
郑叔亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Star Times Software Technology Co ltd
Original Assignee
Beijing Star Times Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Star Times Software Technology Co ltd filed Critical Beijing Star Times Software Technology Co ltd
Priority to CN202110215213.6A priority Critical patent/CN113010393A/en
Publication of CN113010393A publication Critical patent/CN113010393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3664Environments for testing or debugging software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3696Methods or tools to render software testable
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/08Computing arrangements based on specific mathematical models using chaos models or non-linear system models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Nonlinear Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault drilling method and device based on chaotic engineering, relates to the technical field of internet, and can analyze and model large-scale distributed system scenes, integrate the large-scale distributed system scenes into a chaotic engineering method and effectively cope with complex fault drilling scenes. The method comprises the following steps: configuring a resource map corresponding to a target system structure, and marking test points in the resource map; formulating a fault drilling execution plan based on the resource map and the test points, and constructing an automatic test program; running an automatic test program, and executing fault drilling in a target system to obtain monitoring result data of the test point; constructing a performance analysis model according to the resource map and the test points, and updating parameters of the performance analysis model by using the monitoring result data; and simulating fault points in the output resource map based on the updated performance analysis model. The system is applied with the method provided by the scheme.

Description

Fault drilling method and device based on chaotic engineering
Technical Field
The invention relates to the technical field of internet, in particular to a fault drilling method and device based on chaotic engineering.
Background
Distributed systems are adopted in enterprise application systems, internet of things systems and large-scale cloud platform systems. Distributed systems naturally contain a large number of interaction and dependency points, and the number of places where errors can occur is overwhelming. Such as hard disk failure, network failure, surge of traffic over certain components, etc., any such occurrence may result in traffic stall, reduced system performance, or other various unexpected abnormal behaviors if handled poorly. In a complex distributed system, the failure cannot be completely prevented by human power alone, and efforts should be made to identify as many vulnerable, failure-prone links in the system that can cause these anomalies before they are triggered. When these risks are identified, the system engineer can reinforce and prevent the system in a targeted manner, thereby avoiding serious consequences caused by the occurrence of a fault. The chaos engineering is a methodology for actively finding out weak links in a system by performing experiments on system infrastructure, and the method can create a more flexible system through experimental verification and simultaneously enable a system engineer to master various behavior rules during system operation more thoroughly.
The chaos engineering method is proved to be of great significance in practice and verification of a few large Internet companies, such as Netflix and Alibama. The chaos engineering method belongs to an experience-driven method, and a set of unified flow and technical standard is not formed yet. Therefore, all technologies (models, algorithms, architectures and the like) which follow the basic principle and meet the target can be used as a good method for improving the efficiency of the chaotic engineering.
In addition, in the technical solutions disclosed so far, there are few patents related to chaotic engineering, such as "fault drilling method, device, equipment and computer storage medium" (application No. 201910570965.7), which state a fault drilling process and drilling data processing method of a financial system, and in which "chaotic engineering tools" are used to perform experiments. The chaos engineering tool is only a proxy tool which can automatically start, stop and run data collection for a software program. Conceptually, the scheme is actually a set of chaotic engineering practice scheme on the whole. A similar scheme is also provided: a fault drilling method, device and system (application number: 201811516432.2). The ideas of the schemes are similar, namely a configuration part deploys a fault drilling container, a fault drilling tool is used for executing fault drilling, and finally a result is obtained, the common problem exists in that the functional logic of the chaotic engineering tool is simple, and basically, the chaotic engineering tool is started, stopped, recovered, closed and the like of software programs and services, so that the chaotic engineering tool is unlikely to cope with complex fault scenes, such as network jitter, access attack, hardware damage, network partitioning and the like.
Disclosure of Invention
The invention aims to provide a chaos engineering-based fault drilling method and a chaos engineering-based fault drilling device, which can analyze and model a large-scale distributed system scene, integrate the large-scale distributed system scene into a chaos engineering method and effectively cope with a complex fault drilling scene.
In order to achieve the above object, a first aspect of the present invention provides a method for performing fault drilling based on chaotic engineering, including:
configuring a resource map corresponding to a target system structure, and marking test points in the resource map;
formulating a fault drilling execution plan based on the resource map and the test points, and constructing an automatic test program;
running an automatic test program, and executing fault drilling in a target system to obtain monitoring result data of the test point;
constructing a performance analysis model according to the resource map and the test points, and updating parameters of the performance analysis model by using the monitoring result data;
and simulating and outputting fault points in the resource map based on the updated performance analysis model.
Preferably, configuring the resource map corresponding to the target system structure further includes:
classifying and defining fault points, and constructing an event information structure model when the fault points occur;
when a fault event occurs in a test point, describing fault point information by adopting a uniform data format based on the event information structure model.
Preferably, the method for establishing the fault drilling execution plan based on the resource map and the test points comprises the following steps:
marking the test points for fault drilling and the correlation among the test points in a resource map spectrum;
instantiating the test points and configuring test tools corresponding to the test points to obtain at least one test task;
organizing the test tasks into a test flow according to a logical relation, and configuring scheduling time when the flow is executed;
and generating an execution plan of the fault drilling according to the test flow and the scheduling time, and constructing an automatic test program.
Preferably, the method for constructing a performance analysis model according to the resource map and the test points comprises the following steps:
and training a performance analysis model by adopting a queuing network model according to the resource map and the test points.
Preferably, the method for updating parameters of the performance analysis model by using the monitoring result data includes:
inputting loads to a target system to implement fault drilling, and simultaneously acquiring monitoring result data of each test point through a test tool and sequentially reading the monitoring result data according to a time sequence;
carrying out uniform data format conversion and storage on the monitoring result data which are read in sequence according to an event information structure model;
judging whether a fault point exists in the current fault drilling result or not according to the fault threshold definition corresponding to each test point;
inputting the load into the performance analysis model to obtain simulation result data corresponding to the monitoring result data one to one;
and updating the configuration parameters of the simulation result data when the difference exceeds a threshold range by comparing the monitoring result data with the simulation result data until an optimal performance analysis model is obtained.
Further, the method for simulating and outputting the fault point in the resource map based on the updated performance analysis model comprises the following steps:
and based on the optimal performance analysis model, simulating, analyzing and predicting potential fault points in the resource map when inputting new loads.
Compared with the prior art, the fault drilling method based on the chaotic engineering provided by the invention has the following beneficial effects:
the chaos engineering based fault drilling method provided by the invention needs to carry out quantitative abstraction according to the network, calculation and storage resources required by the running of software and hardware modules in a target system and the correlation relation between the software and hardware modules, to construct a resource map corresponding to the structure of the target system, at the same time, to mark corresponding test points in the resource map, and to deploy a test tool to detect the test points with faults in the target system, namely the fault points, then to formulate a corresponding fault drilling execution plan according to the resource map and the test points, to construct an automatic test program of the fault drilling target system, when the automatic test program runs, to execute the fault drilling in the target system and obtain the monitoring result data of the test points, in addition to construct a performance analysis model according to the resource map and the test points, then to use the monitoring result data actually measured by the fault drilling to carry out parameter updating on the performance analysis model, and finally, simulating potential fault points in the resource map by using the performance analysis model until an optimal performance analysis model is output, thereby realizing the function of predicting the fault points in the target system.
Therefore, the method and the device can effectively predict the scene of the possible occurrence of the fault in the target system by applying the scheme of the invention, further effectively prevent by taking targeted measures in advance, and improve the stability of the system. Moreover, the performance analysis model is integrated into a chaotic engineering method through training, so that the method is more suitable for complex scenes of large-scale distributed systems.
A second aspect of the present invention provides a chaotic engineering based fault drilling device, to which the chaotic engineering based fault drilling method described in the above technical solution is applied, the device including:
the configuration unit is used for configuring a resource map corresponding to a target system structure and marking test points in the resource map;
the planning unit is used for making a fault drilling execution plan based on the resource map and the test points and constructing an automatic test program;
the fault drilling unit is used for running an automatic test program and executing fault drilling in a target system to obtain monitoring result data of the test point;
the model updating unit is used for constructing a performance analysis model according to the resource map and the test points and updating parameters of the performance analysis model by using the monitoring result data;
and the fault prediction unit is used for simulating and outputting fault points in the resource map based on the updated performance analysis model.
Preferably, the method further comprises the following steps:
the classification definition unit is used for classifying and defining the fault points and constructing an event information structure model when the fault points occur;
and the data conversion unit is used for describing fault point information by adopting a uniform data format based on the event information structure model when a fault event occurs in the test point.
Preferably, the method further comprises the following steps:
the load input unit is used for inputting a load to a target system to implement fault drilling, and simultaneously, the monitoring result data of each test point is collected through a test tool and is sequentially read according to the time sequence;
the data reading unit is used for carrying out uniform data format conversion and storage on the sequentially read monitoring result data according to the event information structure model;
the fault judging unit is used for judging whether a fault point exists in the current fault drilling result according to the fault threshold definition corresponding to each test point;
the simulation output unit is used for inputting the load into the performance analysis model to obtain simulation result data corresponding to the monitoring result data one by one;
and the model updating unit is used for updating the configuration parameters of the simulation result data when the difference exceeds a threshold value range by comparing the monitoring result data with the simulation result data until an optimal performance analysis model is obtained.
Compared with the prior art, the beneficial effects of the chaos engineering based fault drilling device provided by the invention are the same as the beneficial effects of the chaos engineering based fault drilling method provided by the technical scheme, and are not repeated herein.
A third aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to execute the steps of the chaos engineering-based fault drilling method.
Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the invention are the same as the beneficial effects of the chaos engineering-based fault drilling method provided by the technical scheme, and are not repeated herein.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 is a first schematic flow chart of a chaos engineering-based fault drilling method according to an embodiment of the present invention;
FIG. 2 is a second schematic flow chart of the chaos engineering-based fault drilling method according to the embodiment of the present invention;
FIG. 3 is a data dimension diagram of an event information structure model according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a resource graph in an embodiment of the invention;
FIG. 5 is a diagram illustrating an exemplary model of a fault drilling execution plan in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a process for analyzing and predicting performance behavior of a target system according to an embodiment of the present invention;
fig. 7 is a schematic view of an iteration flow of fault drilling according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1 and fig. 2, the present embodiment provides a method for performing a fault based on chaotic engineering, including:
configuring a resource map corresponding to a target system structure, and marking test points in the resource map; formulating a fault drilling execution plan based on the resource map and the test points, and constructing an automatic test program; running an automatic test program, and executing fault drilling in a target system to obtain monitoring result data of the test point; constructing a performance analysis model according to the resource map and the test points, and updating parameters of the performance analysis model by using the monitoring result data; and simulating fault points in the output resource map based on the updated performance analysis model.
The chaos engineering-based fault drilling method provided in this embodiment needs to perform quantization and abstraction according to the network, calculation and storage resources required by the software and hardware modules in the target system during operation and the correlation between the software and hardware modules, to construct a resource map corresponding to the target system structure, at the same time, mark corresponding test points in the resource map, deploy a test tool to detect the test points with faults in the target system, that is, the fault points, then formulate a corresponding fault drilling execution plan according to the resource map and the test points, to construct an automated test program for the fault drilling target system, when the automated test program is running, the fault drilling can be executed in the target system and the monitoring result data of the test points can be obtained, in addition, a performance analysis model can be constructed according to the resource map and the test points, and then the performance analysis model is updated by using the monitoring result data actually measured by the fault drilling, and finally, simulating potential fault points in the resource map by using the performance analysis model until an optimal performance analysis model is output, thereby realizing the function of predicting the fault points in the target system. Therefore, the application of the scheme of the embodiment can effectively predict the possible scene of the fault in the target system, further take targeted measures in advance to effectively prevent the fault and improve the stability of the system. Moreover, the performance analysis model is integrated into a chaotic engineering method through training, so that the method is more suitable for complex scenes of large-scale distributed systems.
The objective to be achieved by this embodiment is to introduce a quantitative performance behavior model and a matched big data analysis method into the core method of the chaotic engineering to form a set of data-driven chaotic engineering method, and design a chaotic engineering tool set based on this, which is convenient for implementation of the scheme. Specifically, the method comprises the following points:
1. the performance measurement, modeling and analysis method of the system is added into the chaotic engineering method of the embodiment, and the experience method and the quantitative model of the chaotic engineering are combined to achieve a more accurate measurement and self-adaptive optimization method;
2. analyzing and mining the data of each round of fault drilling, and finding out a potential fault mode and fault probability prediction so as to drive system upgrade optimization and a new round of fault drilling;
3. the scenes of a large-scale distributed system are analyzed and modeled, and are integrated into the chaotic engineering method of the embodiment, so that the configuration, management and automatic scheduling of complex scenes can be realized, and the complex problems can be effectively solved;
4. the chaotic engineering method is enhanced with continuous integration, execution and optimization capabilities, and the chaotic engineering method and the popular software research and development process are fused, so that the remarkable efficiency and effect improvement is realized.
In the above embodiment, configuring the resource map corresponding to the target system structure further includes:
classifying and defining fault points, and constructing an event information structure model when the fault points occur; when a fault event occurs in a test point, describing fault point information by adopting a uniform data format based on an event information structure model.
In specific implementation, the fault point can be regarded as a target measured, predicted and optimized by a chaotic engineering experiment. Illustratively, the classification definition of the failure point is as shown in the following table 1:
Figure BDA0002952937950000081
Figure BDA0002952937950000091
TABLE 1
As shown in table 1, when any test point in the target system fails, that is, a failure point is detected in the experiment process of the chaotic engineering, a corresponding event is triggered. The event data needs to contain relevant context information. For example, the data of the event "network delay exceeds DN ms" contains information such as the time when the event occurs, the identities or IP addresses of two communication endpoints, and the like, in addition to the DN value. In this embodiment, a general event information structure model is designed to uniformly express events and related information, as shown in fig. 3, the event information structure model includes the following seven-dimensional information around an event:
event classification: specific categories for identifying the current event, such as a primary category and a secondary category. The event classification can have a tree-shaped parent-child hierarchy and can also contain a combination relationship. Each event classification entity at least has a fixed code with global uniqueness and a label name supporting multiple languages;
a main body: for representing active participants participating in an event, such as system identification, container identification, software identification, process ID, and related information;
object: passive participants for representing participation events mainly refer to specific physical and logical objects of fault point occurrence and behavior expressions thereof, such as a CPU, a memory, a disk and the like refer to physical objects, a state of a process, quality of service and the like refer to logical objects, and the behavior expressions refer to specific detection values, such as BRR, BWR and the like;
time: the time information of the event occurrence is represented by a time stamp, and generally the time information is accurate to seconds, and if the time information is micro-scale big data (such as system performance measurement and monitoring), the time information needs to be accurate to nanosecond precision. In addition, for international consideration, access to time zone data of event occurrence is also required;
space: and the spatial information of the event occurrence is used for positioning the logical and/or physical position of the event occurrence, so that the management and the tracing of data are facilitated. The spatial data may be organized in a multi-level grid pattern to accommodate different large data scenarios. Even can interface with a professional GIS system if necessary;
contact points: the system is used for recording the interface of the contact between the experiment platform and the main body when an event occurs, namely detecting and exploring the position, the means and the like of the event;
additional data: supplemental data for different event categories;
through the seven-dimensional information of the event information structure model, the fault point information can be described in an all-around manner and can be used as basic data of the whole chaotic engineering method.
Resource map quantification describes the components and relationships between components of physical or virtualized resources upon which the execution of chaotic engineering experiments depends. The measurement, prediction and optimization of the fault point must not leave the operating environment in which it is located. To a certain extent, the operating environment is the constraint of resource conditions, specifically, in the scene of chaotic engineering, any fault drilling experimental environment needs to have one or more sets of resource allocation, and the resource map can express the dependency and restriction relation between the capacity information of resources and the resources and can support the flexible dynamic allocation of the fault drilling experimental environment.
The resources are morphologically divided into physical and virtualized resources, and functionally divided into computing resources, storage space resources, I/O resources, and network resources. Computing resources typically include a CPU, GPU, or dedicated computing card; the storage space resources comprise temporary storage resources used by matching with the computing resources, such as cache, memory, video memory and the like, and persistent storage resources, such as hard disks; I/O resources, i.e., resources represented by input/output devices or interfaces, such as network cards, PCIe, SATA, USB, etc.; network resources are resources such as switches, routers, etc. that determine network bandwidth and data switching capabilities.
The resource graph describes resource information and interrelationships through a graph data model, wherein the graph data model comprises nodes and edges, the nodes are entity objects, and each entity object comprises a category label and attribute data. An edge is a node which directionally connects two nodes and represents the relationship between the two nodes, and the edge data also comprises a category label and attribute data. The graph data model has the advantages that the attribute information of the entities and the relations can be dynamically adjusted and upgraded, the relations between the entities can be dynamically adjusted, and rich graph algorithms can perform efficient retrieval of entity object information, mining of implicit relations and the like.
The correlation of resources is a logical abstraction reflecting the relationship between modules in a real system. For example, there are 100 CPUs, 6400GB memory and 500TB storage in a pool of uniformly schedulable physical resources, and these resources are actually distributed on an average over 50 machines, so that there is no possibility of performance of the same level between any two resources, and the performance of communication between the CPUs and storage across the server is certainly much lower than that of the CPUs and storage within a single service. Even within a single server, the bandwidth between different CPUs and the memory of different slots may vary. For another example, the performance of a virtualized resource is largely determined by the performance of the underlying physical resource, so that the performance of two virtualized resources (e.g. virtual CPUs with 4 cores) with equivalent specifications may also be different. Therefore, it is necessary to express such detailed but important information by defining more subdivided resource relationships. Knowledge maps are well suited to address this problem.
In addition, the dependency graph of the resource is also an abstraction of the computer software and hardware system architecture. For example, in a simple server cluster system, a plurality of servers are interconnected through a switch, and each server is deployed with a plurality of virtualization containers, which can be expressed by a resource map as shown in fig. 4. Including physical server nodes, switch nodes, and virtualized container nodes. The relationship between a physical server node and a switch node is "physical direct connection", the relationship between two physical server nodes is "network interconnection", the relationship between a virtualized container node and a physical server node where the virtualized container node is located is "boarder", the relationship between two virtualized container nodes can be subdivided into "symbiotic interconnection", that is, deployment on the same physical server, and "cross-machine interconnection".
By way of example, table 2 lists node type definitions for the resource graph, and table 3 lists relationship definitions between nodes:
Figure BDA0002952937950000111
Figure BDA0002952937950000121
TABLE 2
Relationship classification Main label Attribute information
Physical direct connection R_DIRECT_CON ID of two connected endpoints
Network interconnection R_NET_CON ID of two connected endpoints
Lodging R_HOSTING Virtual resource ID, physical resource ID
Symbiotic interconnect R_BROTHER_CON ID of two connected endpoints
Cross-machine interconnection R_CROSS_CON ID of two connected endpoints
Included R_CONTAIN ID of two connected endpoints
Binding R_BIND ID of two connected endpoints
TABLE 3
In the above embodiment, the method for making the fault drilling execution plan based on the resource map and the test points and constructing the automatic test program includes:
marking test points for fault drilling and the mutual relation among the test points in a resource map; instantiating test points and configuring test tools corresponding to the test points to obtain at least one test task; organizing the test tasks into a test flow according to a logical relation, and configuring scheduling time when the flow is executed; and generating an execution plan of the fault drilling according to the test flow and the scheduling time, and constructing an automatic test program.
In specific implementation, the fault drilling execution plan includes definition of a task flow and scheduling of execution time. Based on the resource map, a reasonable task flow is set by taking the concerned test point as a target, and then a task flow operation schedule can be set according to the overall arrangement of system research and development and online. The specific process is as follows:
1. marking test points for fault drilling and the mutual relation among the test points in a resource map library;
2. instantiating a target test point, associating the target test point with the resource, and configuring a test tool corresponding to each test point;
3. organizing the test tasks of the fault drilling into a flow according to a logical relationship;
4. configuring a scheduling scheme of execution time or time for the process, such as a certain time point, or periodic time, or cron time expression;
5. and initializing related drilling tools and waiting for the execution of the fault drilling plan.
It can be appreciated that the model of the fault drill execution plan is shown in fig. 5:
an execution flow is organized by a plurality of tasks according to a workflow mode and is configured with a time scheme for scheduling flow execution;
each task is associated with an upper target fault point instance and a relevant tool for performing a drill;
initializing the drilling tool requires target test point information and resource information as parameters for configuration;
the fault point instance needs to be associated on a certain resource to have practical significance.
In the above embodiment, the method for constructing the performance analysis model according to the resource map and the test points includes: and training a performance analysis model by adopting a queuing network model according to the resource map and the test points.
In specific implementation, the classical queuing network model and the variants are important tools for constructing macroscopic analysis of computer software and hardware systems. Different software and hardware local modules can construct different queuing network subnets according to the behavior characteristics. The local modules are combined together to form a queuing network of the complete system. The queuing network has a complete mathematical model and an analysis method, so that a set of mathematical model can be constructed to be used as a performance analysis model of the system. The embodiment does not improve the performance analysis model and the training process, and adopts the existing mature performance queuing network model as a part of the support scheme.
With respect to queuing network models, the THEORY of system Performance Modeling and analysis can be found IN the book Performance Modeling and Design of Computer Systems queuing method IN ACTION, by Cambridge university Press, of Mor Harchol-Balter, university of Meilong, Calif.
Through the performance analysis model, the delay and concurrent throughput of the system can be evaluated equivalently under a certain input assumption (usually the arrival of requests conforming to a certain distribution, such as a poisson distribution). The structure of the model needs to be derived from the node relationship of the resource map, and some parameters in the model need to be derived from the nominal performance or actual measurement of the resource, such as the time cost or unit time efficiency of the processor for processing a specific task of a certain type.
In the embodiment, an automatic test program is operated, and fault drilling is executed in a target system to obtain monitoring result data of the test point; the above process can be understood as an execution process of the fault drilling, specifically, the fault drilling execution is to execute a fault drilling plan according to a time scheme, and technically speaking is to create a flow example of the fault drilling execution plan, and then schedule the execution at a given time point. In the process of executing the flow, the drilling tool is actually called in each task instance to complete the given task work. During the execution of the fault drilling, a large amount of monitoring result data can be generated, and the data can be collected and summarized through a test tool and prepared for the next analysis work.
In the above embodiment, the method for updating parameters of the performance analysis model by using the monitoring result data includes:
inputting loads to a target system to implement fault drilling, and simultaneously acquiring monitoring result data of each test point through a test tool and sequentially reading the monitoring result data according to a time sequence; carrying out uniform data format conversion and storage on the sequentially read monitoring result data according to the event information structure model; judging whether a fault point exists in the current fault drilling result or not according to the fault threshold definition corresponding to each test point; inputting the load into a performance analysis model to obtain simulation result data corresponding to the monitoring result data one by one; and updating the configuration parameters of the simulation result data when the difference exceeds the threshold range by comparing the monitoring result data with the simulation result data until the optimal performance analysis model is obtained.
In specific implementation, the monitoring result data obtained by fault drilling execution is time sequence data, that is, the system monitoring data collected at each time point in the fault drilling process are arranged and stored according to a time sequence. The process of updating the parameters of the performance analysis model according to the data obtained by the fault drilling execution is shown in fig. 6, and the process is explained as follows:
1. collecting time sequence data: the formats and storage modes of the performance monitoring data obtained by running different testing tools are different, but the performance monitoring data are time series data in nature. Therefore, the first step is to read the time sequence data from a plurality of sources (test points) in sequence according to the time sequence;
2. and uniformly transforming and storing the time series data according to the event information structure model: the acquired time series data can be mapped into data of an event information structure model structure through structure transformation, and the data are uniformly stored in a big data storage system, such as semi-structured column databases of HBase, ClickHouse and the like;
3. analyzing whether the test point has a fault: judging whether the result of the current fault drilling experiment exceeds a threshold value according to the threshold value definition corresponding to each test point, and if the result exceeds the threshold value, determining that the corresponding test point has a fault;
4. analyzing the core performance behavior characteristics of the system through a performance analysis model: and taking the monitoring result data of the fault drilling as input, and calculating to obtain core performance behavior characteristics, such as the maximum throughput and the end-to-end response duration which can be theoretically achieved by the system under the current input distribution and pressure. The theoretical value can be compared with the measured value. If the difference is large, the difference between the experimental environment and the theoretical model needs to be analyzed, the setting of the experimental environment is adjusted, or the structure or the parameters of the performance analysis model are adjusted until the difference between the experimental environment and the theoretical model reaches an acceptable degree;
5. adjusting parameters of the performance analysis model: if a fault point occurs in the experiment of the round, the configuration parameters of the performance analysis model can be adjusted in a targeted manner by combining the analysis condition of the core performance behavior characteristics, namely, the key factors causing the fault can be found out according to the theoretical model, and the structure or the parameters of the experiment system can be adjusted according to the key factors. For example, a resource node where a certain fault is located depends on another resource node, or has a competitive relationship with a certain resource node, a real bottleneck resource can be found through a performance model, so that parameters of the bottleneck resource are adjusted to eliminate the fault;
6. analyzing and predicting the performance behavior of the system: by combining the performance analysis model and the result data of fault drilling execution, a specific performance behavior expression value can be obtained, and then the key behavior expression of the target system can be more accurately described and predicted by means of big data, particularly the prediction of quantitative behavior aiming at the concerned fault point. The prediction result is used as an important basis for adjusting the configuration and parameters of each link in the next round of chaotic engineering experiment. Therefore, the subject process restarts a new round of chaotic engineering experiments: updating the relevant configuration, updating the fault drilling execution plan, updating the performance analysis model and the like.
In the above embodiment, the method for simulating a fault point in an output resource map based on the updated performance analysis model includes:
and based on the optimal performance analysis model, simulating, analyzing and predicting potential fault points in the resource map when inputting new loads. Therefore, the potential fault point of the target system can be accurately simulated according to the corresponding load by adopting the optimal performance analysis model, so that an engineer is informed of taking corresponding measures in advance, and the running stability of the target system is ensured.
It should be emphasized that, in the present embodiment, the iteration of the chaotic engineering is not simply to repeatedly perform the fault drilling experiment, but combines the process of fault drilling and the process of software release online. In general, the execution environment of the fault drilling may be an online production environment of software, or may be a "preview" environment before online (so-called "staring environment"). If the fault drilling is the former, the fault drilling is ensured not to influence formal services; in the latter case, the degree of freedom of the fault drilling is greater, but some fault points are less easy to monitor. In addition, if the resource conditions allow, separate fault drilling environments can also be deployed for independent experiments.
An important premise for the fusion of the core process of the chaotic engineering and the software engineering process is the development of the automatic software release and continuous integration technology, and the automatic integration and release tool can also help software to be deployed and online in a fault drilling environment, so that the traditional process and the chaotic engineering process are opened, as shown in fig. 7, wherein a dotted line represents the feedback and adjustment of the original process through the result analysis of the fault drilling, for example, directly feeding back a system problem to the software development process, or feeding back a request for system parameter adjustment to the software rehearsal and software online process, and further possibly feeding back to the design and implementation of a software development process adjustment system.
In specific implementation, the most important implementation scenario of this embodiment is the upgrading of the software engineering method and process. Generally, like software rehearsal and software online, fault drilling work is handled by an operation and maintenance team or an implementation team, so that adjustment of system configuration parameters is performed in a team internal process, and an iterative upgrade process of software research and development is triggered only when a research and development team needs to solve deeper problems. In addition, the timing to trigger the fault drill can also be determined by the operation and maintenance team or the implementation team, i.e. the operation and maintenance team or the implementation team configures the time scheduling scheme for the fault drill execution, because the requirements of the software on various configuration or implementation fields of the production environment are all taken charge of by the team.
Meanwhile, a software research and development team needs to add a mechanism and a process for optimizing and upgrading the system by quickly responding to the information fed back by the fault drilling. Unlike simple "field requirements and problem feedback," structured feedback information described by models and data is accurate and can efficiently guide a research and development team to locate and solve software problems.
In addition, the chaotic engineering method based on the model and the big data drive in the embodiment can also be applied to upgrading of the automatic operation and maintenance technology. For those systems that require 7x24 hours to continue to serve a large number of users, the experiments of the fault drill are often conducted in an online production environment. Because of the enormous cost, it is difficult to reconstruct another experimental environment that can be matched to the on-line environment. There is a need for a system that can dynamically open up a temporary set of service nodes for failover drilling. The set of nodes must be restored for online service after the drill is completed. Even if the targeted fault drilling experiment is not performed, only daily operation and maintenance monitoring and system optimization work are performed, and the configuration, monitoring and data acquisition and analysis of the system in the whole process need to be supported by an automatic process. The solution of the invention can effectively support the realization of the operation and maintenance technology.
In summary, the present embodiment has the following beneficial effects:
1. iterative upgrading of the fault drilling experiment is driven through a quantitative analysis model and big data analysis and prediction, and the method is more targeted and directional, and can effectively predict scenes in which faults are likely to occur so as to realize early prevention;
2. the fault drilling process is integrated into the software research, development, test and online processes, the automation degree of the fault drilling process is improved, and the side effect of manual intervention is reduced;
3. the resource map can analyze and model scenes of a large-scale distributed system, can realize configuration, management and automatic scheduling of complex scenes, and effectively solves the problem of the complex scenes.
Example two
The embodiment provides a chaos engineering-based fault drilling device, which comprises:
the configuration unit is used for configuring a resource map corresponding to a target system structure and marking test points in the resource map;
the planning unit is used for making a fault drilling execution plan based on the resource map and the test points and constructing an automatic test program;
the fault drilling unit is used for running an automatic test program and executing fault drilling in a target system to obtain monitoring result data of the test point;
the model updating unit is used for constructing a performance analysis model according to the resource map and the test points and updating parameters of the performance analysis model by using the monitoring result data;
and the fault prediction unit is used for simulating and outputting fault points in the resource map based on the updated performance analysis model.
Preferably, the method further comprises the following steps:
the classification definition unit is used for classifying and defining the fault points and constructing an event information structure model when the fault points occur;
and the data conversion unit is used for describing fault point information by adopting a uniform data format based on the event information structure model when a fault event occurs in the test point.
Preferably, the method further comprises the following steps:
the load input unit is used for inputting a load to a target system to implement fault drilling, and simultaneously, the monitoring result data of each test point is collected through a test tool and is sequentially read according to the time sequence;
the data reading unit is used for carrying out uniform data format conversion and storage on the sequentially read monitoring result data according to the event information structure model;
the fault judging unit is used for judging whether a fault point exists in the current fault drilling result according to the fault threshold definition corresponding to each test point;
the simulation output unit is used for inputting the load into the performance analysis model to obtain simulation result data corresponding to the monitoring result data one by one;
and the model updating unit is used for updating the configuration parameters of the simulation result data when the difference exceeds a threshold value range by comparing the monitoring result data with the simulation result data until an optimal performance analysis model is obtained.
Compared with the prior art, the beneficial effects of the chaos engineering based fault drilling device provided by the embodiment of the invention are the same as the beneficial effects of the chaos engineering based fault drilling method provided by the first embodiment of the invention, and are not repeated herein.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the chaos engineering-based fault drilling method.
Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the embodiment are the same as the beneficial effects of the chaos engineering-based fault drilling method provided by the above technical scheme, and are not repeated herein.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the invention may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the embodiment, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A fault drilling method based on chaotic engineering is characterized by comprising the following steps:
configuring a resource map corresponding to a target system structure, and marking test points in the resource map;
formulating a fault drilling execution plan based on the resource map and the test points, and constructing an automatic test program;
running an automatic test program, and executing fault drilling in a target system to obtain monitoring result data of the test point;
constructing a performance analysis model according to the resource map and the test points, and updating parameters of the performance analysis model by using the monitoring result data;
and simulating and outputting fault points in the resource map based on the updated performance analysis model.
2. The method of claim 1, wherein configuring the resource graph corresponding to the target system architecture further comprises:
classifying and defining fault points, and constructing an event information structure model when the fault points occur;
when a fault event occurs in a test point, describing fault point information by adopting a uniform data format based on the event information structure model.
3. The method according to claim 1 or 2, wherein the step of establishing a fault drilling execution plan based on the resource map and the test points comprises the following steps:
marking the test points for fault drilling and the correlation among the test points in a resource map spectrum;
instantiating the test points and configuring test tools corresponding to the test points to obtain at least one test task;
organizing the test tasks into a test flow according to a logical relation, and configuring scheduling time when the flow is executed;
and generating an execution plan of the fault drilling according to the test flow and the scheduling time, and constructing an automatic test program.
4. The method of claim 1 or 2, wherein the method of constructing a performance analysis model from the resource map and the test points comprises:
and training a performance analysis model by adopting a queuing network model according to the resource map and the test points.
5. The method of claim 4, wherein the method for updating parameters of the performance analysis model using the monitoring result data comprises:
inputting loads to a target system to implement fault drilling, and simultaneously acquiring monitoring result data of each test point through a test tool and sequentially reading the monitoring result data according to a time sequence;
carrying out uniform data format conversion and storage on the monitoring result data which are read in sequence according to an event information structure model;
judging whether a fault point exists in the current fault drilling result or not according to the fault threshold definition corresponding to each test point;
inputting the load into the performance analysis model to obtain simulation result data corresponding to the monitoring result data one to one;
and updating the configuration parameters of the simulation result data when the difference exceeds a threshold range by comparing the monitoring result data with the simulation result data until an optimal performance analysis model is obtained.
6. The method of claim 5, wherein simulating the output of the fault point in the resource map based on the updated performance analysis model comprises:
and based on the optimal performance analysis model, simulating, analyzing and predicting potential fault points in the resource map when inputting new loads.
7. A chaos engineering based fault drilling device is characterized by comprising:
the configuration unit is used for configuring a resource map corresponding to a target system structure and marking test points in the resource map;
the planning unit is used for making a fault drilling execution plan based on the resource map and the test points and constructing an automatic test program;
the fault drilling unit is used for running an automatic test program and executing fault drilling in a target system to obtain monitoring result data of the test point;
the model updating unit is used for constructing a performance analysis model according to the resource map and the test points and updating parameters of the performance analysis model by using the monitoring result data;
and the fault prediction unit is used for simulating and outputting fault points in the resource map based on the updated performance analysis model.
8. The apparatus of claim 7, further comprising:
the classification definition unit is used for classifying and defining the fault points and constructing an event information structure model when the fault points occur;
and the data conversion unit is used for describing fault point information by adopting a uniform data format based on the event information structure model when a fault event occurs in the test point.
9. The apparatus of claim 8, further comprising:
the load input unit is used for inputting a load to a target system to implement fault drilling, and simultaneously, the monitoring result data of each test point is collected through a test tool and is sequentially read according to the time sequence;
the data reading unit is used for carrying out uniform data format conversion and storage on the sequentially read monitoring result data according to the event information structure model;
the fault judging unit is used for judging whether a fault point exists in the current fault drilling result according to the fault threshold definition corresponding to each test point;
the simulation output unit is used for inputting the load into the performance analysis model to obtain simulation result data corresponding to the monitoring result data one by one;
and the model updating unit is used for updating the configuration parameters of the simulation result data when the difference exceeds a threshold value range by comparing the monitoring result data with the simulation result data until an optimal performance analysis model is obtained.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 6.
CN202110215213.6A 2021-02-25 2021-02-25 Fault drilling method and device based on chaotic engineering Pending CN113010393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110215213.6A CN113010393A (en) 2021-02-25 2021-02-25 Fault drilling method and device based on chaotic engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110215213.6A CN113010393A (en) 2021-02-25 2021-02-25 Fault drilling method and device based on chaotic engineering

Publications (1)

Publication Number Publication Date
CN113010393A true CN113010393A (en) 2021-06-22

Family

ID=76387477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110215213.6A Pending CN113010393A (en) 2021-02-25 2021-02-25 Fault drilling method and device based on chaotic engineering

Country Status (1)

Country Link
CN (1) CN113010393A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935178A (en) * 2021-10-21 2022-01-14 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN114113984A (en) * 2021-11-29 2022-03-01 平安壹账通云科技(深圳)有限公司 Fault drilling method, device, terminal equipment and medium based on chaotic engineering
CN114389849A (en) * 2021-12-17 2022-04-22 中电信数智科技有限公司 Disaster recovery drilling method and system for network security
CN114609995A (en) * 2022-03-04 2022-06-10 亚信科技(南京)有限公司 Fault control method, device, system, equipment, medium and product
CN115033415A (en) * 2022-06-21 2022-09-09 北京同创永益科技发展有限公司 Chaotic engineering fault evaluation method based on FMEA
CN115081653A (en) * 2022-07-27 2022-09-20 南京争锋信息科技有限公司 Multi-environment multi-architecture chaotic engineering full life cycle management and control method and system
CN115438518A (en) * 2022-11-08 2022-12-06 恒丰银行股份有限公司 Fault simulation application system based on chaos concept
CN116542000A (en) * 2023-05-05 2023-08-04 华能威海发电有限责任公司 Power grid refinement management system based on source network data analysis
CN116703144A (en) * 2023-08-02 2023-09-05 深圳市东微智能科技股份有限公司 Exercise information acquisition method, device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271233A1 (en) * 2007-07-31 2009-10-29 Schlumberger Technology Corporation Valuing future information under uncertainty
CN110308969A (en) * 2019-06-26 2019-10-08 深圳前海微众银行股份有限公司 Failure drilling method, device, equipment and computer storage medium
US20200257280A1 (en) * 2019-02-12 2020-08-13 Siemens Aktiengesellschaft Method for Checking an Industrial Facility, Computer Program, Computer-Readable Medium and System
CN111831569A (en) * 2020-07-22 2020-10-27 平安普惠企业管理有限公司 Test method and device based on fault injection, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271233A1 (en) * 2007-07-31 2009-10-29 Schlumberger Technology Corporation Valuing future information under uncertainty
US20200257280A1 (en) * 2019-02-12 2020-08-13 Siemens Aktiengesellschaft Method for Checking an Industrial Facility, Computer Program, Computer-Readable Medium and System
CN110308969A (en) * 2019-06-26 2019-10-08 深圳前海微众银行股份有限公司 Failure drilling method, device, equipment and computer storage medium
CN111831569A (en) * 2020-07-22 2020-10-27 平安普惠企业管理有限公司 Test method and device based on fault injection, computer equipment and storage medium

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935178A (en) * 2021-10-21 2022-01-14 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN113935178B (en) * 2021-10-21 2022-09-16 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN114113984A (en) * 2021-11-29 2022-03-01 平安壹账通云科技(深圳)有限公司 Fault drilling method, device, terminal equipment and medium based on chaotic engineering
CN114389849A (en) * 2021-12-17 2022-04-22 中电信数智科技有限公司 Disaster recovery drilling method and system for network security
CN114389849B (en) * 2021-12-17 2024-04-16 中电信数智科技有限公司 Disaster recovery and backup exercise method and system for network security
CN114609995A (en) * 2022-03-04 2022-06-10 亚信科技(南京)有限公司 Fault control method, device, system, equipment, medium and product
CN115033415A (en) * 2022-06-21 2022-09-09 北京同创永益科技发展有限公司 Chaotic engineering fault evaluation method based on FMEA
CN115081653A (en) * 2022-07-27 2022-09-20 南京争锋信息科技有限公司 Multi-environment multi-architecture chaotic engineering full life cycle management and control method and system
CN115438518A (en) * 2022-11-08 2022-12-06 恒丰银行股份有限公司 Fault simulation application system based on chaos concept
CN116542000A (en) * 2023-05-05 2023-08-04 华能威海发电有限责任公司 Power grid refinement management system based on source network data analysis
CN116542000B (en) * 2023-05-05 2024-01-26 华能威海发电有限责任公司 Power grid refinement management system based on source network data analysis
CN116703144A (en) * 2023-08-02 2023-09-05 深圳市东微智能科技股份有限公司 Exercise information acquisition method, device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113010393A (en) Fault drilling method and device based on chaotic engineering
CN110309071B (en) Test code generation method and module, and test method and system
US8056046B2 (en) Integrated system-of-systems modeling environment and related methods
CN100412871C (en) System and method to generate domain knowledge for automated system management
EP2572294B1 (en) System and method for sql performance assurance services
Shahid et al. Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment
Castiglione et al. Modeling performances of concurrent big data applications
De Gooijer et al. An industrial case study of performance and cost design space exploration
US11880271B2 (en) Automated methods and systems that facilitate root cause analysis of distributed-application operational problems and failures
US11880272B2 (en) Automated methods and systems that facilitate root-cause analysis of distributed-application operational problems and failures by generating noise-subtracted call-trace-classification rules
CN111159897B (en) Target optimization method and device based on system modeling application
JP2006048702A (en) Automatic configuration of transaction-based performance model
CN115169810A (en) Artificial intelligence system construction method and device for power grid regulation
Li et al. Microservice migration using strangler fig pattern: A case study on the green button system
Hariri et al. Hierarchical modeling of availability in distributed systems
Willnecker et al. Optimization of deployment topologies for distributed enterprise applications
EP4152715A1 (en) Method and apparatus for determining resource configuration of cloud service system
Mikov et al. Program tools and language for Network simulation and analysis
Dobre et al. New trends in large scale distributed systems simulation
Ulrich et al. Operator timing of task level primitives for use in computation-based human reliability analysis
Krawczuk et al. Anomaly detection in scientific workflows using end-to-end execution gantt charts and convolutional neural networks
Karami et al. Maintaining accurate web usage models using updates from activity diagrams
US20210286785A1 (en) Graph-based application performance optimization platform for cloud computing environment
Hein et al. Performance and dependability evaluation of scalable massively parallel computer systems with conjoint simulation
Kharchenko et al. Technology Oriented Assessment of Software Reliability: Big Data Based Search of Similar Programs.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination