CN115438518B - Fault simulation application system based on chaos concept - Google Patents

Fault simulation application system based on chaos concept Download PDF

Info

Publication number
CN115438518B
CN115438518B CN202211388032.4A CN202211388032A CN115438518B CN 115438518 B CN115438518 B CN 115438518B CN 202211388032 A CN202211388032 A CN 202211388032A CN 115438518 B CN115438518 B CN 115438518B
Authority
CN
China
Prior art keywords
drilling
event
management module
management
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211388032.4A
Other languages
Chinese (zh)
Other versions
CN115438518A (en
Inventor
李祥胜
丁瑜
陈磊
李宗清
杨效伟
王纪良
宇中雷
鹿铁峰
张雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hengfeng Bank Co ltd
Original Assignee
Hengfeng Bank Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hengfeng Bank Co ltd filed Critical Hengfeng Bank Co ltd
Priority to CN202211388032.4A priority Critical patent/CN115438518B/en
Publication of CN115438518A publication Critical patent/CN115438518A/en
Application granted granted Critical
Publication of CN115438518B publication Critical patent/CN115438518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/08Computing arrangements based on specific mathematical models using chaos models or non-linear system models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Evolutionary Computation (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Nonlinear Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention belongs to the technical field of fault simulation, and particularly provides a fault simulation application system based on a chaos concept, which comprises an environment management module, an event management module, a drill management module, a system authority management module and an evaluation module; the environment management module is used for managing development, test and production environments of users; the event management module is used for managing system state events and fault injection events in the system operation process; the drilling management module is used for simulating production events by combining different fault types; the system management module is used for managing the drilling members and distributing the authority; and the examination evaluation module is used for evaluating and feeding back the solution process and scheme result of the simulation production event aiming at different drilling processes and modes. The problem of difficult simulation and reproduction of events is solved by rapidly injecting single or full link faults into an application system.

Description

Fault simulation application system based on chaos concept
Technical Field
The invention relates to the technical field of fault simulation, in particular to a fault simulation application system based on a chaos concept.
Background
The operation and maintenance mode of the scientific and technological system is a temporary handling mode based on the occurrence of events, has the characteristics of burstiness, unknown property and difficult handling and positioning, and is often used for handling the operation and maintenance events in practical application and use with long time consumption and large problem handling risk, which is mainly caused by the following reasons: a. the reason for the staff level is disposed. The experience and level of operation and maintenance personnel are different, the operation and maintenance event is often influenced by various factors such as experience or operation proficiency and the like when the operation and maintenance event is handled, and certain personnel risk exists in event handling; b. some emergency manuals lack effective validation. The system emergency manual is an operation guide when an on-line event of the system occurs, and is often not subjected to effective environmental verification before use, so that certain operation risk exists in the use stage of the system, and unknown influence is generated on the disposal of the operation and maintenance event; c. the simulation of the event is difficult to reproduce.
The method is characterized in that the on-line accidental events are difficult to reproduce, problems are analyzed, positioned and treated to generate obstacles, the events are not generated by a single factor, the problems are reproduced and positioned and treated to be more important under the condition that risk events occur due to multiple factors, and flexible event reproduction means are lacked at present.
Disclosure of Invention
The method is characterized in that the on-line accidental events are difficult to reproduce, problems are analyzed, positioned and treated to generate obstacles, the events are not generated by a single factor, the problems are reproduced and positioned and treated to be more important under the condition that risk events occur due to multiple factors, and flexible event reproduction means are lacked at present. The invention aims to solve the problem of passivity in the traditional operation and maintenance work, and provides a fault simulation application system based on a chaos concept for positioning and analyzing a problem and reproducing a fault.
The technical scheme of the invention provides a fault simulation application system based on a chaos concept, which comprises an environment management module, an event management module, a drill management module and an evaluation module;
the environment management module is used for collecting environment information of the server side through the probe program, acquiring state information of the server side in real time through a timing heartbeat detection mechanism and judging the health state of the server side, and the probe program is used for realizing environment deployment of the system through establishing connection with the server side;
the event management module has an atomic-level fault simulation capability, and manages the fault which can be simulated by the server to generate an event through the environment deployment of the server probe program based on the environment management module;
the drilling management module is used for performing multi-dimensional arrangement based on the event generated by the event management module, generating faults through a probe program in the environment management module, triggering drilling actions and receiving an operation screenshot in the drilling process;
and the examination and evaluation module is used for carrying out examination and evaluation on the result according to the drilling time length, result judgment, the operation steps and compliance operation and outputting an evaluation result based on the operation screenshot of the drilling process received by the drilling management module.
As a further limitation of the technical scheme of the invention, the environment management module comprises a machine management submodule, an application management submodule and a deployment unit management submodule;
the machine management submodule is used for managing the machine where the user application is located, and specifically editing the application system, the deployment unit and the remark information where the machine is located; disabling/enabling a machine probe program; performing multi-dimensional query management through the application name, the deployment unit, the ip address and the state;
the application management submodule is used for managing different applications of a user, and specifically comprises the steps of checking, modifying and deleting an application name, an application code, an application description and a remark;
the deployment unit management submodule is used for managing different deployment units of a certain application; and viewing, modifying and deleting the name, the code, the description and the remark of the deployment unit.
As a further limitation of the technical scheme of the invention, the event management module is also used for carrying out multi-dimensional event query through event types, event states, event names and event codes; and editing the event type, the event name, the event code and the event state.
As a further limitation of the technical scheme of the invention, the drilling management module comprises a drilling library submodule, an intra-group drilling submodule and a case library submodule;
the drilling library submodule is used for creating drilling, and is specifically used for selecting drilling content, selecting drilling personnel and selecting a drilling system to edit a drilling event after the service endpoint clicks to start drilling;
the in-group drilling sub-module is used for selecting drilling content, selecting drilling personnel in the group member and selecting a drilling system to edit drilling events after the service endpoint clicks to start drilling;
and the case base submodule is used for storing the events generated in the event management module, filing and editing the problem faults to generate a drilling case and form a case base.
As a further limitation of the technical solution of the present invention, the attribute information of the drill includes the serial numbers of all drills, the belonging group, the drill type, the drill duration, the creator, the creation time, the latest experiment state, the release state, the drill description, and the information of the drill operation.
As a further limitation of the technical scheme of the invention, the in-group drilling submodule comprises a random drilling unit, an emergency drilling unit and an assessment and evaluation unit;
the random drilling unit is used for randomly extracting based on all drilling cases in the case base, generating drilling tasks, triggering to implement drilling and receiving screenshots of the drilling operation process;
the emergency drilling unit is used for generating a drilling task aiming at a set specific scene, triggering to perform drilling and receiving a screenshot of a drilling operation process;
and the examination evaluation unit is used for performing examination evaluation on the screenshot in the drilling process and outputting an evaluation result.
As a further limitation of the technical solution of the present invention, the system further comprises a system management module, which is used for managing the drill members, controlling the authority, managing the group, and controlling the authority of the system menu.
As a further limitation of the technical scheme of the invention, the system management module comprises a user management submodule, an authority management submodule and a task group management submodule;
the user management submodule is used for displaying all system user information lists and supporting query and screening according to user names, real names, affiliated roles and states; editing, forbidding/enabling, resetting passwords, adding and deleting the user;
the authority management submodule is used for displaying a list of all user roles, displaying all menus capable of setting the authority and supporting the distribution of corresponding menu authority according to the role types; editing, adding and deleting the menu list;
the task group management submodule is used for displaying a task group information list which is created by all the systems and supporting the inquiry and screening according to the task group name and the task group state; view, edit, disable/enable tasks.
As a further limitation of the technical scheme of the invention, the assessment evaluation module is specifically used for judging and scoring the assessment result on the basis of the handling step of the simulation production event and the handling screenshot picture and on the basis of the set scoring rule.
As a further limitation of the technical scheme of the invention, the application process of the system comprises the following steps:
confirming that the drilling environment is in an available state through an environment management module at a server, and selecting an event needing drilling at this time to create a drilling scene through an event management module at the server, wherein the drilling event is created in a drilling management case base; and the server sends an instruction to the probe program through the http service, the probe program performs life cycle management of the fault according to the drilling event, and the drilling scene is created completely. And after the drilling personnel log in the drilling system through ssh, drilling implementation is carried out, the screenshots of the drilling process are uploaded to the drilling management module of the server, and after drilling is finished, the server-side assessment and evaluation module carries out assessment and evaluation according to the uploaded screenshots of the drilling process and the consumed time to output an evaluation result.
According to the technical scheme, the invention has the following advantages: the problem of difficult simulation and reproduction of events is solved by rapidly injecting single or full link faults into an application system. The method is mainly applied to a bank enterprise-level application system, and is used for simulating a fault scene and improving the usability level of the system.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Therefore, compared with the prior art, the invention has prominent substantive features and remarkable progress, and the beneficial effects of the implementation are also obvious.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic block diagram of a system of one embodiment of the present invention.
FIG. 2 is a schematic block diagram of the environment management module of the system of one embodiment of the present invention.
FIG. 3 is a schematic block diagram of a drill management module of the system of one embodiment of the present invention.
FIG. 4 is a schematic block diagram of a system management module of one embodiment of the present invention.
Detailed Description
The invention aims to solve the problem of the passive working method in the traditional operation and maintenance work, provide an application system for flexibly reproducing faults and effectively verifying events in the operation and maintenance work, and provide the application system for flexibly reproducing the faults and effectively verifying the events in the operation and maintenance work. In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a fault simulation application system based on a chaos concept, which includes an environment management module, an event management module, a drill management module, and an assessment and evaluation module;
the environment management module is used for collecting environment information of the server through the probe program, acquiring state information of the server in real time through a timing heartbeat detection mechanism and judging the health state of the server, and the probe program establishes contact with the server to realize environment deployment of the system;
the environment management module is used for managing system environment information related to the platform system and comprises a system name, a deployment unit, a deployment example and the like; the environment information of the target server, such as ip, server name, cpu, memory configuration and disk space, is collected through the probe program, the server state information is obtained in real time through a timing heartbeat detection mechanism and is used for judging the health state of the server, the probe program establishes a management relation with the server side, the environment management function of the system is achieved, and the function realization of the event management module and the fault drilling module is supported.
The event management module has an atomic-level fault simulation capability, and manages the fault which can be simulated by the server to generate an event through environment deployment of a probe program of the server based on the environment management module;
the event management module has the fault simulation capability of more than 200 atomic levels common to the system, and can realize the management and subsequent drilling use of the fault which can be simulated by the server through the environment deployment of the target server probe program based on the environment management module. The fault simulation capability is completed by a probe program, and the probe program is deployed in the drilling system and keeps alive with a server side by sending heartbeat through http. When the probe program is installed, the highest authority of the drilling system is obtained, the system kernel can be tampered, the hardware drive can be changed, faults of network equipment, a CPU, a memory and a disk can be realized, and a real scene can be simulated. For example, when the network card is down, the server sends an instruction to the probe program to instruct the fault event (including fault generation time, fault end time and fault type) of the network card, the probe program receives the instruction to isolate the network card drive of the drilling system, the network card of the drilling system is in an offline state and keeps isolation from other systems, and when the fault end time is reached, the probe program removes the network card drive isolation, so that the fault removing effect is achieved.
The drilling management module is used for performing multi-dimensional arrangement based on the event generated by the event management module, generating faults through a probe program in the environment management module, triggering drilling actions and receiving an operation screenshot in the drilling process;
and the examination evaluation module is used for carrying out examination evaluation on the result according to the exercise time length, result judgment, operation steps and compliance operation and outputting an evaluation result on the basis of the exercise process operation screenshot received by the exercise management module.
The system mainly realizes the simulation generation of environmental faults, assists the analysis and positioning of system problems, scheme determination, implementation and solution and assists the improvement of the capacity of operation and maintenance personnel. The drilling management module arranges the event scenes generated by the event management module based on the system information in the environment management module, implements simulation actions through drilling, and generates assessment results based on the assessment evaluation module.
As shown in fig. 2, the environment management module includes a machine management submodule, an application management submodule, and a deployment unit management submodule;
the machine management submodule is used for managing the machine where the user application is positioned, and specifically editing the application system, the deployment unit and the remark information where the machine is positioned; disabling/enabling the machine probe program; performing multi-dimensional query management through the application name, the deployment unit, the ip address and the state;
the application management submodule is used for managing different applications of a user, and specifically comprises the steps of checking, modifying and deleting an application name, an application code, an application description and a remark;
the deployment unit management submodule is used for managing different deployment units of a certain application; and viewing, modifying and deleting the name of the deployment unit, the code of the deployment unit, the description of the deployment unit and the remark.
The machine management sub-module comprises functions of machine query, machine addition, machine check, machine editing, machine starting and machine forbidding and is responsible for managing the machine where the user application is located; the application management sub-module comprises functions of application query, application addition, application viewing, application editing and application deletion and is responsible for managing different applications of a user; an application is often divided into different deployment units according to different services provided, such as an online deployment unit, a batch deployment unit, a database deployment unit, and the like, and the deployment unit management submodule is mainly responsible for managing different deployment units of an application.
The event management module is also used for carrying out multi-dimensional event query through event types, event states, event names and event codes; and editing the event type, the event name, the event code and the event state.
The application of the distributed technology at present makes the infrastructure more complex than the traditional industry, especially the financial industry relates to a large amount of fund transactions, and covers complex infrastructures such as multiple data centers, multiple activities, disaster recovery, containers, virtual machines and the like. The reproduction and processing of production events often require coordination of different professionals in different departments, which is time consuming and resource consuming. 218 fundamental failures can be realized, covering tens of specialties such as K8S, CPU, disk, network, DNS, shared storage, memory, IO, JVM, message queue, cache, database, etc. Specifically, each fault type also includes typical faults such as common disk read-write, disk filling, killing process, CPU full, network delay, network packet loss and the like. The system interacts with a probe program installed on a target machine, and by sending different instructions, a specified fault is injected to the target machine by one key, such as killing Cheng Guzhang, and a KILL command is sent to the target machine; network delay is the network delay caused by sending a TC command to a target machine; the JVM type fault is a java exception type which is self-defined by the injection target machine to reproduce memory exception and the like. The common faults can be injected by one key of the system without building a foundation or coordinating multiple departments to reproduce cooperatively. Events of complex or full link transaction scenarios can additionally be simulated by injecting custom exception faults, delayed execution, and orchestration of base fault capabilities in the program execution method. The time sequence abnormity in the bank transaction is a more classical business scene, when the financial transaction is sent and the transaction is not completely successful due to network timeout or other reasons, the original transaction needs to be corrected so as to prevent the situation that the user successfully transfers the account but the actual opposite side does not receive the account. The replication of such a scenario requires the ability to orchestrate multiple failures. The steps in the system are as follows, (1) injecting JVM class delay exception in initiating transfer transaction; (2) injecting JVM custom exception (3) into the conflict resolution, recovering the fault in the step (2) (4) injecting the transfer transaction into JVM custom exception (5) and recovering the fault in the step (2). By the method and various arrangements, faults of code layers can be carried out at a finer granularity to reproduce complex service scenes.
As shown in fig. 3, the drilling management module includes a drilling library sub-module, an in-group drilling sub-module, and a case library sub-module;
the drilling library submodule is used for creating drilling, and is specifically used for selecting drilling content, selecting drilling personnel and selecting a drilling system to edit a drilling event after the service endpoint clicks to start drilling;
the in-group drilling sub-module is used for selecting drilling content, selecting drilling personnel in the group member and selecting a drilling system to edit drilling events after the service endpoint clicks to start drilling;
and the case base submodule is used for storing the events generated in the event management module, filing and editing the problem faults to generate a drilling case and form a case base. The case base can be checked, edited and deleted, and the drill can be rapidly generated based on the case. The case library sub-module is generally used for summarizing after large-scale practice, filing and editing common frequently-occurring problem faults, and forming a case library. And generating an event in the event management module and then storing the event in a case library for subsequent viewing and rapid exercise. The rapid implementation of the drill only requires the selection of personnel and the drill system environment.
The drill library sub-module can create rapid drilling, create full link drilling, reset drilling in a group, reset random drilling, display the sequence number of all drilling, the group to which the drill belongs, the drilling type, the drilling duration, the creator, the creation time, the recent experiment state, the release state, the drilling description and the operation information. The query list can be screened and displayed according to different conditions, and operations such as checking, editing, deleting, releasing, starting to perform, performing results and the like can be performed on the drill; the in-group drilling sub-module is a function realized for special application and crowd simulation drilling, and mainly comprises three sub-function modules of random drilling, emergency drilling and assessment and evaluation; the case base submodule can quickly form a fault close to a real scene by combining basic faults. Including add, modify, and delete event case functions, fast rehearsal functions, and full link rehearsal functions.
As shown in fig. 4, the system management module includes a user management submodule, a right management submodule, and a task group management submodule;
the user management submodule is used for displaying all system user information lists and supporting query and screening according to user names, real names, affiliated roles and states;
the authority management submodule is used for displaying all user role lists, displaying all menus capable of setting authority and supporting the distribution of corresponding menu authority according to role types;
and the task group management submodule is used for displaying the created task group information lists of all the systems and supporting the query and screening according to the task group names and the task group states.
The system management module mainly manages the drilling members, the distribution authority and the like, and the user management submodule displays all system user information lists and supports inquiry and screening according to user names, real names, affiliated roles and states. The user can be added, edited, user role edited, enabled, disabled, deleted and password reset; the authority management submodule displays all user role lists, displays all menus capable of setting authority and supports the distribution of corresponding menu authority according to role types; and the task group management submodule displays the created task group information lists of all the systems and supports the inquiry and screening according to the task group names and the task group states. The task group can be newly added, edited, checked, forbidden and the like.
And the assessment evaluation module is specifically used for judging and scoring the assessment result and evaluating the related treatment cases on the basis of the treatment steps of the simulation production events and the treatment screenshot photos and on the basis of the related scoring rules.
And the assessment evaluation module is used for giving fair evaluation and feedback to the solution process and scheme result of the simulation production event aiming at different drilling processes and modes. The module is based on the treatment steps of the simulation production events and the treatment screenshot photos, judges and scores the assessment results based on the relevant scoring rules, and can evaluate the relevant treatment cases and correlate the relevant knowledge bases. The assessment and evaluation team can enter the relevant associated knowledge base, generalize and integrate common production events generated in the operation and maintenance process, and associate the common production events into the relevant knowledge base. After the drilling is finished, the disposal personnel can click the relevant knowledge base to check the standard emergency disposal method and the case for further study and review, so that a closed loop process of simulating a production event, emergency disposal drilling, drilling result evaluation and review is achieved, and a forward incentive effect on daily operation and maintenance work is achieved.
The drill management module also comprises a compliance management submodule used for judging the compliance of the operation of the drill management module. The environment management module further comprises a compliance management submodule for performing compliance management on the deployment of the application. The module mainly provides compliance test questions to check whether the flow of the production problems processed by the user is in compliance or not. The system inputs compliance test questions in advance from the aspects of supervision requirements, systems, event handling and the like, and a user needs to answer the questions and then process specific events before processing the simulation events.
The drilling implementation process comprises three parts of drilling scene creation, drilling task implementation and assessment:
the server side confirms that the drilling environment is in an available state through the environment management module (the probe program sends heartbeat packets regularly, and if the heartbeat packets are received within 5 minutes, the heartbeat packets can be judged to be available), and then the event management module of the server side selects events needing drilling (random drilling does not need to be selected, and random selection is performed), such as faults of CPU abnormity, network card abnormity and the like. The drilling events are created in a drilling-managed case base, the case names, duration, fault times, fault systems, ip addresses and specific fault types need to be edited, and the fault types are generated by referring to the event management module of claim 3. After the drilling scene is created, the server sends an instruction to the probe program through the http service, and the probe program performs life cycle management of a fault according to a drilling event, for example, network packet loss of a drilling machine is performed at a specific time, and a network packet loss command is run on the drilling machine, so that the creation of the drilling scene is completed. The method comprises the steps that a drilling person logs in a drilling system through ssh to conduct drilling implementation, the main work is troubleshooting and repairing of faults, the troubleshooting depends on a third-party monitoring system, for example, machine restarting faults can be shown as machine monitoring information loss, the drilling person needs to log in a drilling system machine through ssh, a program relevant starting command is executed, and after technical verification is passed, screenshots of key steps in a drilling process are uploaded to a server side. And after the drilling is finished, the server side performs assessment according to the uploaded screenshots and the consumed time in the drilling process.
Although the present invention has been described in detail by referring to the drawings in connection with the preferred embodiments, the present invention is not limited thereto. Various equivalent modifications or substitutions can be made on the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and these modifications or substitutions should be within the scope of the present invention/any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure and the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A fault simulation application system based on a chaos concept is characterized by comprising an environment management module, an event management module, a drill management module and an examination and evaluation module;
the environment management module is used for collecting environment information of the server side through the probe program, acquiring state information of the server side in real time through a timing heartbeat detection mechanism and judging the health state of the server side, and the probe program is used for realizing environment deployment of the system through establishing connection with the server side;
the event management module has an atomic-level fault simulation capability, and manages the fault which can be simulated by the server to generate an event through the environment deployment of the server probe program based on the environment management module;
the drilling management module is used for performing multi-dimensional arrangement based on the event generated by the event management module, generating faults through a probe program in the environment management module, triggering drilling actions and receiving an operation screenshot in the drilling process;
the assessment evaluation module is used for assessing and evaluating the result according to the drilling time length, result judgment, operation steps and compliance operation and outputting an evaluation result based on the operation screenshot of the drilling process received by the drilling management module;
confirming that the drilling environment is in an available state through an environment management module at a server, and selecting an event needing drilling at this time to create a drilling scene through an event management module at the server, wherein the drilling event is created in a drilling management case library; the server sends an instruction to the probe program through http service, the probe program performs fault life cycle management according to the drilling event, and the drilling scene is created; and after the drilling personnel log in the drilling system through ssh, drilling implementation is carried out, the screenshots of the drilling process are uploaded to the drilling management module of the server, and after drilling is finished, the server-side assessment and evaluation module carries out assessment and evaluation according to the uploaded screenshots of the drilling process and the consumed time to output an evaluation result.
2. The chaos philosophy-based fault simulation application system of claim 1, wherein the environment management module comprises a machine management submodule, an application management submodule, and a deployment unit management submodule;
the machine management submodule is used for managing the machine where the user application is located, and specifically editing the application system, the deployment unit and the remark information where the machine is located; disabling/enabling the machine probe program; performing multi-dimensional query management through the application name, the deployment unit, the ip address and the state;
the application management submodule is used for managing different applications of a user, and specifically comprises the steps of checking, modifying and deleting an application name, an application code, an application description and a remark;
the deployment unit management submodule is used for managing different deployment units of a certain application; and viewing, modifying and deleting the name, the code, the description and the remark of the deployment unit.
3. The chaos concept-based fault simulation application system of claim 2, wherein the event management module is further configured to perform a multi-dimensional event query by using an event type, an event state, an event name, and an event code; and editing the event type, the event name, the event code and the event state.
4. The chaos concept-based fault simulation application system of claim 3, wherein the drilling management module comprises a drilling library sub-module, an intra-group drilling sub-module, and a case library sub-module;
the drilling library submodule is used for creating drilling, and is specifically used for selecting drilling content, selecting drilling personnel and selecting a drilling system to edit a drilling event after the service endpoint clicks to start drilling;
the in-group drilling sub-module is used for selecting drilling content, selecting drilling personnel in the group member and selecting a drilling system to edit drilling events after the service endpoint clicks to start drilling;
and the case base submodule is used for storing the events generated in the event management module, filing and editing the problem faults to generate a drilling case and form a case base.
5. The chaotic concept-based failure simulation application system according to claim 4, wherein the attribute information of the drills includes serial numbers of all drills, belonging groups, drill types, drill durations, creators, creation times, recent experiment states, release states, drill descriptions, and information of drill operations.
6. The chaos concept-based fault simulation application system of claim 5, wherein the intra-group drilling submodule comprises a stochastic drilling unit, an emergency drilling unit and an assessment evaluation unit;
the random drilling unit is used for randomly extracting based on all drilling cases in the case base, generating drilling tasks, triggering to implement drilling and receiving screenshots of the drilling operation process;
the emergency drilling unit is used for generating a drilling task aiming at a set specific scene, triggering to perform drilling and receiving a screenshot of a drilling operation process;
and the examination evaluation unit is used for performing examination evaluation on the screenshot in the drilling process and outputting an evaluation result.
7. The chaos philosophy-based fault simulation application system of claim 6, further comprising a system management module for performing member management, right control, group management, and right control of system menus.
8. The chaos philosophy-based fault simulation application system of claim 7, wherein the system management module comprises a user management submodule, a permission management submodule and a task group management submodule;
the user management submodule is used for displaying all system user information lists and supporting query and screening according to user names, real names, affiliated roles and states; editing, forbidding/enabling, resetting passwords, adding and deleting the user;
the authority management submodule is used for displaying a list of all user roles, displaying all menus capable of setting the authority and supporting the distribution of corresponding menu authority according to the role types; editing, adding and deleting the menu list;
the task group management submodule is used for displaying a task group information list which is created by all the systems and supporting the inquiry and screening according to the task group name and the task group state; the tasks are viewed, edited, disabled/enabled.
9. The chaos concept-based fault simulation application system of claim 8, wherein the assessment evaluation module is specifically configured to determine and score the assessment result based on the handling steps and the handling screenshot photos of the simulated production events and the set scoring rules.
CN202211388032.4A 2022-11-08 2022-11-08 Fault simulation application system based on chaos concept Active CN115438518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211388032.4A CN115438518B (en) 2022-11-08 2022-11-08 Fault simulation application system based on chaos concept

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211388032.4A CN115438518B (en) 2022-11-08 2022-11-08 Fault simulation application system based on chaos concept

Publications (2)

Publication Number Publication Date
CN115438518A CN115438518A (en) 2022-12-06
CN115438518B true CN115438518B (en) 2023-04-07

Family

ID=84253170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211388032.4A Active CN115438518B (en) 2022-11-08 2022-11-08 Fault simulation application system based on chaos concept

Country Status (1)

Country Link
CN (1) CN115438518B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6018732A (en) * 1998-12-22 2000-01-25 Ac Properties B.V. System, method and article of manufacture for a runtime program regression analysis tool for a simulation engine
CN112464497A (en) * 2020-12-16 2021-03-09 江苏满运物流信息有限公司 Fault drilling method, device, equipment and medium based on distributed system
CN113535532A (en) * 2020-04-14 2021-10-22 中国移动通信集团浙江有限公司 Fault injection system, method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143130B (en) * 2014-07-28 2017-07-14 中国安全生产科学研究院 Accident emergency drilling system and drilling method
US10672289B2 (en) * 2015-09-24 2020-06-02 Circadence Corporation System for dynamically provisioning cyber training environments
CN111786823A (en) * 2020-06-19 2020-10-16 中国工商银行股份有限公司 Fault simulation method and device based on distributed service
CN112714013B (en) * 2020-12-22 2023-02-03 浪潮云信息技术股份公司 Application fault positioning method in cloud environment
CN113010393A (en) * 2021-02-25 2021-06-22 北京四达时代软件技术股份有限公司 Fault drilling method and device based on chaotic engineering
CN113935178B (en) * 2021-10-21 2022-09-16 北京同创永益科技发展有限公司 Explosion radius control system and method for cloud-originated chaos engineering experiment
CN114647489A (en) * 2022-04-02 2022-06-21 阿里巴巴(中国)有限公司 Drill method and system applied to chaotic engineering
CN114791846B (en) * 2022-05-23 2022-10-04 北京同创永益科技发展有限公司 Method for realizing observability aiming at cloud-originated chaos engineering experiment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6018732A (en) * 1998-12-22 2000-01-25 Ac Properties B.V. System, method and article of manufacture for a runtime program regression analysis tool for a simulation engine
CN113535532A (en) * 2020-04-14 2021-10-22 中国移动通信集团浙江有限公司 Fault injection system, method and device
CN112464497A (en) * 2020-12-16 2021-03-09 江苏满运物流信息有限公司 Fault drilling method, device, equipment and medium based on distributed system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Fully Decentralized Multi-Agent Fault Location and Isolation for Distribution Networks With DGs;Wenguo Li等;《IEEE Access》;20210209;全文 *
分布式实时系统的软件故障注入;徐光侠等;《重庆大学学报》;20100215(第02期);全文 *

Also Published As

Publication number Publication date
CN115438518A (en) 2022-12-06

Similar Documents

Publication Publication Date Title
US20210075821A1 (en) Cyber Security Posture Validation Platform
Kumar et al. Practical machine learning for cloud intrusion detection: Challenges and the way forward
CN113067728A (en) Network security attack and defense test platform
CN112187585B (en) Network protocol testing method and device
US8234633B2 (en) Incident simulation support environment and business objects associated with the incident
US20100192220A1 (en) Apparatuses, methods and systems for providing a virtual development and deployment environment including real and synthetic data
Hogganvik et al. A graphical approach to risk identification, motivated by empirical investigations
WO2018216000A1 (en) A system and method for on-premise cyber training
CN107168844B (en) Performance monitoring method and device
Jiménez‐Ramírez et al. Automated testing in robotic process automation projects
US20100223190A1 (en) Methods and systems for operating a virtual network operations center
CN107423090B (en) Flash player abnormal log management method and system
CN111597104A (en) Multi-protocol adaptive interface regression testing method, system, equipment and medium
Bernardi et al. Using process mining and model-driven engineering to enhance security of web information systems
CN115396182A (en) Industrial control safety automatic arrangement and response method and system
Alghamdi Effective penetration testing report writing
CN115438518B (en) Fault simulation application system based on chaos concept
Alrimawi et al. Incidents are meant for learning, not repeating: sharing knowledge about security incidents in cyber-physical systems
KR102254693B1 (en) Cyber security training system having network writing function
CN113301040B (en) Firewall strategy optimization method, device, equipment and storage medium
CN109583192A (en) A kind of fixed safety system of mobile terminal application and method based on emulation
Allison et al. Digital Twin-Enhanced Incident Response for Cyber-Physical Systems
JP2011034274A (en) Automatic test execution system
CN114706738A (en) Method and device for automatically burying point at client
Arantes et al. Tool support for generating model-based test cases via web

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant