CN112631850A - Fault scene simulation method and device - Google Patents
Fault scene simulation method and device Download PDFInfo
- Publication number
- CN112631850A CN112631850A CN202011538845.8A CN202011538845A CN112631850A CN 112631850 A CN112631850 A CN 112631850A CN 202011538845 A CN202011538845 A CN 202011538845A CN 112631850 A CN112631850 A CN 112631850A
- Authority
- CN
- China
- Prior art keywords
- fault
- information
- resource information
- system resource
- monitoring data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004088 simulation Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012544 monitoring process Methods 0.000 claims abstract description 87
- 230000002159 abnormal effect Effects 0.000 claims abstract description 62
- 238000004458 analytical method Methods 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 13
- 230000000737 periodic effect Effects 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000004886 process control Methods 0.000 claims description 2
- 230000002547 anomalous effect Effects 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 27
- 238000005553 drilling Methods 0.000 description 48
- 238000010586 diagram Methods 0.000 description 24
- 230000008569 process Effects 0.000 description 19
- 238000012423 maintenance Methods 0.000 description 18
- 230000005856 abnormality Effects 0.000 description 17
- 238000012806 monitoring device Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 13
- 230000001788 irregular Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000015556 catabolic process Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 6
- 239000000872 buffer Substances 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 4
- 238000003064 k means clustering Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/26—Functional testing
- G06F11/261—Functional testing by simulating additional hardware, e.g. fault simulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a method and a device for simulating a fault scene, which relate to the field of artificial intelligence, wherein the method comprises the following steps: acquiring monitoring data of a service system, wherein the monitoring data comprises: application service information and system resource information; identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model, wherein the identification model is constructed based on historical monitoring data, and the historical monitoring data comprises the following steps: historical application service information and historical system resource information; generating a fault scene according to the system resource information in the identified abnormal time period, wherein the fault scene comprises: at least one anomaly indicator derived from system resource information in an anomaly time period; and performing fault scene simulation operation on the simulation service system according to the fault scene. By the method and the device, the actual fault scene can be accurately simulated.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method and a device for simulating a fault scene.
Background
With the popularization of the internet and the rapid increase of internet users, the traditional single application system cannot meet the requirements of increasing user pressure on system capacity and high availability. By modifying the application of the single body into a distributed service mode, the continuously increased system pressure can be effectively solved, which is verified in the commodity promotion of a plurality of internet companies. Compared with the traditional single host application, the complexity of a distributed application architecture and infrastructure is greatly increased, errors can be made anywhere in the system, various unpredictable emergencies cannot be avoided, if the problems are reduced, the problems can be exposed more frequently, the problems are found and solved, and the fault tolerance of the system is improved.
The chaotic fault drilling is a technology used in such scenes, the system risk is detected in advance by using experiments, the system risk is solved by improving the architecture optimization and the operation and maintenance mode, the high-availability and high-toughness distributed architecture is really realized, and the risk of enterprise loss is reduced.
However, the fault drilling tool commonly used in the industry only provides a basic fault simulation mechanism, performs simulation aiming at a single system index, lacks the simulation capability of an actual scene, and has poor accuracy of index simulation.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for simulating a fault scenario to solve at least one of the above-mentioned problems.
According to a first aspect of the present invention, there is provided a method for simulating a fault scenario, the method comprising:
acquiring monitoring data of a service system, wherein the monitoring data comprises: application service information and system resource information;
identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model, wherein the identification model is constructed based on historical monitoring data, and the historical monitoring data comprises: historical application service information and historical system resource information;
generating a fault scene according to the system resource information in the identified abnormal time period, wherein the fault scene comprises: at least one anomaly indicator derived from system resource information in the anomaly time period;
and performing fault scene simulation operation on the simulation service system according to the fault scene.
According to a second aspect of the present invention, there is provided a fault scenario simulation apparatus, the apparatus comprising:
an information monitoring unit, configured to obtain monitoring data of a service system, where the monitoring data includes: application service information and system resource information;
an exception identifying unit, configured to identify an exception time period in the acquired application service information and system resource information according to a preset identification model, where the identification model is constructed based on historical monitoring data, and the historical monitoring data includes: historical application service information and historical system resource information;
a fault scenario generation unit, configured to generate a fault scenario according to the system resource information in the identified abnormal time period, where the fault scenario includes: at least one anomaly indicator derived from system resource information in the anomaly time period;
and the fault scene simulation unit is used for carrying out fault scene simulation operation on the simulation service system according to the fault scene.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when executing the program.
According to a fourth aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the technical scheme, the obtained application service information and the system resource information are analyzed according to the identification model, the abnormal time period is identified, then the fault scene is generated according to the system resource information in the identified abnormal time period, and the fault scene simulation operation is performed on the simulation service system according to the fault scene.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow diagram of a method of simulating a fault scenario according to an embodiment of the invention;
FIG. 2 is a block diagram of a simulation apparatus for fault scenarios according to an embodiment of the present invention;
FIG. 3 is a fault drilling device system for AI-based Linux kernel process group control according to an embodiment of the present invention;
fig. 4 is a block diagram of the structure of the service monitoring apparatus 1 according to the embodiment of the present invention;
fig. 5 is a block diagram of the structure of the system monitoring apparatus 2 according to the embodiment of the present invention;
fig. 6 is a block diagram of the AI operation and maintenance analysis device 3 according to the embodiment of the present invention;
fig. 7 is a block diagram of the structure of the fault drilling orchestration device 4 according to the embodiment of the present invention;
fig. 8 is a block diagram of the structure of the fault drilling implementation apparatus 5 according to the embodiment of the present invention;
fig. 9 is a schematic block diagram of a system configuration of an electronic apparatus 600 according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Because the current fault drilling tool only provides a basic fault simulation mechanism, the simulation is carried out aiming at a single system index, the simulation capability of an actual scene is lacked, and the accuracy of index simulation is poor. Based on this, the embodiment of the invention provides a fault scene simulation scheme, by which accurate simulation can be performed for a production actual fault scene. Embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a method for simulating a fault scenario according to an embodiment of the present invention, as shown in fig. 1, the method includes:
The application service information herein may include: service success rate (i.e., the probability of successful application service request or task execution), throughput (i.e., the number of requests processed by the system per unit time), response time (i.e., the time it takes for an application service to process a request or a task), etc. The system resource information may include: CPU (Central Processing Unit) utilization, memory utilization, and IO (Input/output), which may be IOPS (Input/output Per Second) or the like.
And 102, identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model. Wherein the recognition model is constructed and trained based on historical monitoring data, the historical monitoring data comprising: historical application service information and historical system resource information. The recognition model is trained according to abnormal time periods in the historical monitoring data. Specifically, the training process can be described in the following description of identifying the abnormal time period by the identification model.
And 104, performing fault scene simulation operation on the simulation service system according to the fault scene.
The fault scenario simulation operation herein may include at least one of: CPU fault scene simulation operation, memory fault scene simulation operation, and IO (Input output) fault scene simulation operation.
Compared with a single system index simulation scheme in the prior art, the fault scene of the embodiment of the invention integrates multiple abnormal indexes, so that the actual fault scene can be accurately simulated.
For step 102, classifying the obtained application service information and system resource information according to a preset identification model; and then, analyzing the classified information according to a preset rule, and identifying an abnormal time period according to an analysis result.
In practical operation, the application service information and the system resource information are generally time series data, and the time series data can be generally divided into three categories: stationary, periodic and irregular fluctuation.
In the identification process, the identification model firstly carries out automatic classification aiming at time sequence data, can be realized by adopting an unsupervised clustering or supervised learning algorithm model, improves the classification accuracy, and can consider the classifiers realized by using various different algorithms to carry out integrated voting. After the classification of the time sequence data is completed, aiming at the stable data, a service abnormal scene can be identified according to a set expected threshold value; for periodic data, abnormality detection can be performed in a supervised learning mode; for irregular fluctuation type data, abnormality detection can be performed by manual identification.
In one embodiment, the unsupervised clustering algorithm model may be a K-means (K-means) clustering model. When the model is trained, the acquired historical monitoring data (including historical application service information and historical system resource information) can be input into a K-means clustering model, the K-means clustering model identifies and classifies stable data, periodic data or irregular fluctuation data in the historical monitoring data, and then the parameters of the K-means clustering model are reversely updated according to the classification result so as to train the K-means clustering model. When the accuracy of the classification result reaches a predetermined value (e.g., 95%), the model training may be considered complete.
The supervised learning algorithm model may be a K-Nearest Neighbor (K-Nearest Neighbor) algorithm model. Similarly, when the model is trained, the acquired historical monitoring data can be input into a K-nearest neighbor algorithm model, the K-nearest neighbor algorithm model identifies and classifies stable data, periodic data or irregular fluctuation data in the historical monitoring data, and then the parameters of the K-nearest neighbor algorithm model are reversely updated according to the classification result, so that the K-nearest neighbor algorithm model is trained. When the accuracy of the classification result reaches a predetermined value (e.g., 95%), the model training may be considered complete.
Then, the identification model continues to analyze the classification result, and for the stable data, a service abnormal scene can be identified according to a set expected threshold (which can be set based on an empirical value); for periodic data, anomaly detection can be performed in a supervised learning (for example, a K-nearest neighbor algorithm) mode; and for irregular fluctuation type data, abnormality detection can be performed through manual identification.
After the abnormal time period is identified, at least one abnormal index can be determined according to the system resource information in the identified abnormal time period; and then, performing combined operation on the at least one abnormal index to generate a fault scene. Because the fault scene is fused with a plurality of abnormal indexes, the fault scene can be accurately simulated.
In a specific implementation process, after the abnormal index is determined, the indexes such as the CPU utilization rate and the IO information may be combined to generate a fault scene, and then, a relevant simulation instruction is executed to the simulation service system according to the fault scene. For example, based on a Linux (an operating system) system kernel process control mechanism, the simulation service system is subjected to fault scenario simulation operation according to the fault scenario.
Based on similar inventive concepts, the embodiment of the present invention further provides a simulation apparatus for a fault scenario, and preferably, the apparatus is configured to implement the process in the foregoing method embodiment.
Fig. 2 is a block diagram of a simulation apparatus of the fault scenario, and as shown in fig. 2, the apparatus includes: an information monitoring unit 201, an abnormality recognition unit 202, a fault scenario generation unit 203, and a fault scenario simulation unit 204, wherein:
an information monitoring unit 201, configured to obtain application service information and system resource information of the service system.
And the abnormality identification unit 202 is configured to identify an abnormal time period in the acquired application service information and system resource information according to a preset identification model. Wherein the recognition model is constructed and trained based on historical monitoring data, the historical monitoring data comprising: historical application service information and historical system resource information.
Specifically, the abnormality recognition unit includes: a classification module and an anomaly identification module, wherein: the classification module is used for classifying the acquired application service information and the acquired system resource information according to a preset identification model; and the abnormal recognition module is used for analyzing the classified information according to a preset rule and recognizing an abnormal time period according to an analysis result.
In one embodiment, the anomaly identification module comprises: a stationary data analysis submodule and a periodic data analysis submodule, wherein: the stable data analysis submodule is used for analyzing the classified information according to a preset threshold value when the classified information belongs to stable data; and the periodic data analysis submodule is used for analyzing the classified information based on a supervised learning mode when the classified information belongs to the periodic data.
A failure scenario generating unit 203, configured to generate a failure scenario according to system resource information in the identified abnormal time period, where the system resource information includes: at least one abnormality indicator.
Specifically, the failure scenario generation unit includes: an abnormal index determination module and a fault scenario generation module, wherein: the abnormal index determining module is used for determining an abnormal index according to the system resource information in the identified abnormal time period; and the fault scene generation module is used for performing combined operation on the abnormal indexes to generate a fault scene.
And the fault scene simulation unit 204 is configured to perform a fault scene simulation operation on the simulation service system according to the fault scene.
The fault scenario simulation operation herein includes at least one of: CPU fault scene simulation operation, memory fault scene simulation operation and IO fault scene simulation operation.
The application service information and the system resource information acquired by the information monitoring unit 201 are analyzed by the abnormality identification unit 202 according to the identification model, the abnormal time period is identified, then the fault scene generation unit 203 generates a fault scene according to the system resource information in the identified abnormal time period, and the fault scene simulation unit 204 performs fault scene simulation operation on the simulation service system according to the fault scene.
For a better understanding of the embodiments of the present invention, a specific embodiment is given below based on the Linux system.
Fig. 3 is a fault drilling apparatus system of the Linux kernel process group control based on AI (Artificial Intelligence), as shown in fig. 3, the system including: a service monitoring device 1, a system monitoring device 2, an AI operation and maintenance analysis device 3, a trouble drill arrangement device 4, and a trouble drill implementation device 5. The service monitoring device 1 and the system monitoring device 2 are respectively connected with the AI operation and maintenance analysis device 3; the AI operation and maintenance analysis device 3 is connected with the fault drilling arrangement device 4; the breakdown exercise scheduling apparatus 4 is connected to the breakdown exercise execution apparatus 5.
Preferably, the service monitoring apparatus 1 and the system monitoring apparatus 2 correspond to the above-described information monitoring unit 201, the AI operation and maintenance analysis apparatus 3 corresponds to the above-described abnormality recognition unit 202, the trouble drill scheduling apparatus 4 corresponds to the above-described trouble scene generation unit 203, and the trouble drill execution apparatus 5 corresponds to the above-described trouble scene simulation unit 204.
In the specific implementation process, the workflow of the fault drilling device system comprises the following steps:
step 1): the service monitoring device monitors the service state and supports the collection of application service monitoring indexes.
Step 2): the system monitoring device monitors the state of the system and supports the acquisition of system resource monitoring indexes.
Step 3): the AI operation and maintenance analysis device analyzes according to the application service index acquired by the service monitoring device in the step 1) and the system resource monitoring index acquired by the system monitoring device in the step 2), identifies an abnormal scene, and records the system resource index in the scene to form a multi-index combined fault scene.
Step 4): the fault drilling arrangement device carries out fault drilling task arrangement aiming at the simulation service system according to the fault scene formed by analysis in the step 3).
Step 5): and the fault drilling implementation device receives the drilling tasks issued in the step 4), and simulates different system indexes through a Linux kernel process group control mechanism so as to realize the simulation of the fault scene.
To further understand the practice of the present invention, the five devices are described separately below.
(1) Service monitoring device 1
The service monitoring device 1 monitors application service indexes (such as service success rate, throughput, response time and the like), collects monitoring data and sends the monitoring data to the AI operation and maintenance analysis device 3.
Fig. 4 is a block diagram showing the configuration of the service monitoring apparatus 1, and as shown in fig. 4, the service monitoring apparatus 1 includes: service index monitoring unit 11, control data acquisition unit 12, wherein:
service index monitoring unit 11: the method is used for monitoring the target service state, and comprises the step of monitoring the service success rate, the throughput, the response time and the like.
The monitoring data acquisition unit 12: the system is used for collecting service monitoring data for subsequent analysis.
(2) System monitoring device 2
The system monitoring device 2 monitors the application system indexes (such as CPU utilization, memory utilization, IOPS, etc.), collects the monitoring data, and sends the monitoring data to the AI operation and maintenance analysis device 3.
Fig. 5 is a block diagram showing the configuration of the system monitoring apparatus 2, and as shown in fig. 5, the system monitoring apparatus 2 includes: system index monitoring unit 21, control data acquisition unit 22, wherein:
system index monitoring unit 11: the method is used for monitoring the state of a target system, and comprises monitoring the CPU utilization rate, the memory utilization rate, the IOPS and the like.
The monitoring data acquisition unit 12: the system monitoring data acquisition device is used for acquiring system monitoring data for subsequent analysis.
(3) AI operation and maintenance analysis device 3
The AI operation and maintenance analysis device 3 performs analysis using an AI model based on the monitoring data of the service monitoring device 1 and the system monitoring device 2, identifies a time period of service abnormality, records the use condition of system resources at this stage, and uses the system resources as combined index data of a fault drilling scene. The AI model corresponds to the recognition model in the above method embodiment, and for the specific model training and the abnormality recognition process, reference may be made to the description of the relevant parts, which is not described herein again.
Fig. 6 is a block diagram showing the configuration of the AI operation and maintenance analysis device 3, and as shown in fig. 6, the AI operation and maintenance analysis device 3 includes: an abnormal scene recognition unit 31, a fault scene generation unit 32, wherein:
the abnormal scene recognition unit 31: the method is used for identifying the abnormal scene, and particularly, the abnormal scene is identified through AI model learning analysis based on the service monitoring data and the system monitoring data.
The failure scenario generation unit 32: according to the abnormal scene identified by the abnormal scene identification unit 31, the system resource index change trend of the corresponding time period is recorded, and a system resource index combination index for simulating the abnormal scene is formed.
In practical operation, the monitoring data of the service monitoring apparatus 1 and the system monitoring apparatus 2 are time series data, and the time series data can be generally divided into three types: the monitoring system comprises a stable type monitoring system, a periodic type monitoring system and an irregular fluctuation type monitoring system, wherein the stable type monitoring system, the periodic type monitoring system and the irregular fluctuation type monitoring system can identify abnormal service scenes aiming at the stable type monitoring data and the periodic type monitoring data. Firstly, the time sequence data is automatically classified, unsupervised clustering or an algorithm model based on supervised learning can be adopted for realizing, the classification accuracy is improved, and the classifier realized by various different algorithms can be used for carrying out integrated voting. After the time sequence data classification is completed, setting an expected threshold value aiming at the stable data to identify a service abnormal scene; for periodic data, a supervised learning mode can be adopted for abnormality detection. For irregular fluctuation type data, abnormality detection can be performed by manual operation.
(4) Fault drilling arrangement device 4
The fault drilling orchestration device 4 generates a corresponding fault drilling scenario (i.e., fault scenario) based on the analysis result of the AI operation and maintenance analysis device 3, and provides a fault drilling task orchestration and issuing function. The fault drilling scene is mainly combined according to various abnormal index data in the abnormal time period detected by the AI operation and maintenance analysis device, for example, the abnormal scene has scenes such as CPU overshoot and IO overshoot, and the drilling scene can be generated by combining indexes such as CPU utilization rate and IOPS.
Fig. 7 is a block diagram of the fault practicing arrangement 4, and as shown in fig. 7, the fault practicing arrangement 4 includes: fault drilling task orchestration unit 41, fault drilling task issuing unit 42, wherein:
the breakdown drilling task orchestration unit 41: based on the fault scenario generated by the fault scenario generation unit 32 in the AI operation and maintenance analysis device 3, a fault drilling task is scheduled for the simulation service system (or referred to as a target node). Specifically, according to the abnormal scene in the AI operation and maintenance analysis device 3, the abnormal indexes are extracted, combined, and arranged in time sequence to form a drilling fault scene. The generated fault scene can be used for being associated with a target drilling server to form a certain drilling task.
The breakdown exercise task issuing unit 42: and issuing the scheduled fault drilling task to the simulation service system.
(5) Trouble drill implementing device 5
And the fault drilling implementation device 5 is used for receiving the command sent by the fault drilling arrangement device 4, executing a corresponding drilling instruction according to a fault drilling scene, and accurately simulating the use condition of system resources in an abnormal scene through a Linux kernel process group control mechanism.
Fig. 8 is a block diagram showing the configuration of the breakdown exercise execution apparatus 5, and as shown in fig. 8, the breakdown exercise execution apparatus 5 may include: a CPU fault drilling unit 51, a memory fault drilling unit 52, an IO fault drilling unit 53, and the like, wherein:
CPU failure drill unit 51: the method is used for realizing the CPU fault drilling index. In one example, the CPU fault drilling unit 51 may accurately simulate the CPU occupation index of the fault drilling program through the Linux kernel process group control mechanism.
Specifically, a control group chaos is created under a Linux operating system/sys/fs/cgroup/cpu directory, and the system automatically generates a resource restriction file corresponding to the control group subsystem under the/sys/fs/cgroup/cpu/chaos directory. Accurate fault simulation is realized by specifying the targeted target process PID (/ sys/fs/cgroup/CPU/chaos/tasks) and the CPU resource limit (/ sys/fs/cgroup/CPU/chaos/cpu.cfs _ quota _ us) used by the target process PID through a system call.
Memory failure drill unit 52: the method is used for realizing the memory fault drilling index. In one example, the memory fault drilling unit 52 implements mount of tmpfs (memory-based file system) according to a specified size through mount (a command) instruction provided by the Linux operating system, so as to simulate a fault in a manner of occupying a certain proportion of the memory of the system.
IO failure drilling unit 53: the method is used for realizing the IO fault drilling index. In one example, the IO fault drilling unit 53 implements an occupation index of the fault drilling program for IOPS through a Linux kernel process group control mechanism.
And the byte (bit) number of reading and writing per second is set by a Linux system/sys/fs/cgroup/blkio/throw.
According to the fault drilling scheme based on the AI kernel process group control, provided by the embodiment of the invention, the AI analysis model is adopted to support accurate simulation of the combination condition of the system indexes of the abnormal scene, and the precision of fault simulation can be improved through the Linux kernel process group control mechanism. The scheme supports a combined index fault drilling model based on a real scene, avoids simulation distortion possibly caused by single-dimensional fault simulation behaviors, and improves the accuracy of system drilling faults such as a CPU (Central processing Unit), an IOPS (input/output System) and the like.
In practical operation, the units, the modules and the sub-modules may be combined or may be arranged singly, and the present invention is not limited thereto.
The present embodiment also provides an electronic device, which may be a desktop computer, a tablet computer, a mobile terminal, and the like, but is not limited thereto. In this embodiment, the electronic device may be implemented with reference to the above method embodiment and the embodiment of the fault scene simulation apparatus, and the contents thereof are incorporated herein, and repeated descriptions are omitted.
Fig. 9 is a schematic block diagram of a system configuration of an electronic apparatus 600 according to an embodiment of the present invention. As shown in fig. 9, the electronic device 600 may include a central processor 100 and a memory 140; the memory 140 is coupled to the central processor 100. Notably, this diagram is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the fault scenario simulation functionality may be integrated into the central processor 100. The central processor 100 may be configured to control as follows:
acquiring monitoring data of a service system, wherein the monitoring data comprises: application service information and system resource information;
identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model, wherein the identification model is constructed based on historical monitoring data, and the historical monitoring data comprises: historical application service information and historical system resource information;
generating a fault scene according to the system resource information in the identified abnormal time period, wherein the fault scene comprises: at least one anomaly indicator derived from system resource information in the anomaly time period;
and performing fault scene simulation operation on the simulation service system according to the fault scene.
As can be seen from the above description, according to the electronic device provided in the embodiment of the present invention, the abnormal time period is identified by analyzing the acquired application service information and system resource information according to the identification model, then the fault scene is generated according to the system resource information in the identified abnormal time period, and the fault scene simulation operation is performed on the simulation service system according to the fault scene.
In another embodiment, the fault scene simulator may be configured separately from the central processor 100, for example, the fault scene simulator may be configured as a chip connected to the central processor 100, and the fault scene simulation function is realized by the control of the central processor.
As shown in fig. 9, the electronic device 600 may further include: communication module 110, input unit 120, audio processing unit 130, display 160, power supply 170. It is noted that the electronic device 600 does not necessarily include all of the components shown in FIG. 9; furthermore, the electronic device 600 may also comprise components not shown in fig. 9, which may be referred to in the prior art.
As shown in fig. 9, the central processor 100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processor 100 receiving input and controlling the operation of the various components of the electronic device 600.
The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 100 may execute the program stored in the memory 140 to realize information storage or processing, etc.
The input unit 120 provides input to the cpu 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used to display an object to be displayed, such as an image or a character. The display may be, for example, an LCD display, but is not limited thereto.
The memory 140 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage section 142, and the application/function storage section 142 is used to store application programs and function programs or a flow for executing the operation of the electronic device 600 by the central processing unit 100.
The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage portion 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging application, address book application, etc.).
The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. The communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and receive audio input from the microphone 132 to implement general telecommunications functions. Audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 130 is also coupled to the central processor 100, so that recording on the local can be enabled through a microphone 132, and so that sound stored on the local can be played through a speaker 131.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the fault scene simulation method.
In summary, in order to solve the problems that a single index cannot well simulate a real fault scene during fault drilling and the precision of the fault drilling index is not high, embodiments of the present invention provide a fault drilling scheme, where the scheme supports AI analysis on system resource operation and maintenance monitoring data in a production actual environment to obtain system resource index combination data corresponding to the fault scene, and the fault scene can be accurately simulated by a Linux kernel process group control manner according to the scene combination index.
The preferred embodiments of the present invention have been described above with reference to the accompanying drawings. The many features and advantages of the embodiments are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the embodiments which fall within the true spirit and scope thereof. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the embodiments of the invention to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope thereof.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (12)
1. A method for simulating a fault scenario, the method comprising:
acquiring monitoring data of a service system, wherein the monitoring data comprises: application service information and system resource information;
identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model, wherein the identification model is constructed based on historical monitoring data, and the historical monitoring data comprises: historical application service information and historical system resource information;
generating a fault scene according to the system resource information in the identified abnormal time period, wherein the fault scene comprises: at least one anomaly indicator derived from system resource information in the anomaly time period;
and performing fault scene simulation operation on the simulation service system according to the fault scene.
2. The method of claim 1, wherein identifying abnormal time periods in the obtained application service information and system resource information according to a preset identification model comprises:
classifying the acquired application service information and system resource information according to a preset identification model;
and analyzing the classified information according to a preset rule, and identifying an abnormal time period according to an analysis result.
3. The method of claim 2, wherein analyzing the classified information according to a predetermined rule comprises:
when the classified information belongs to stable data, analyzing the classified information according to a preset threshold value;
and when the classified information belongs to periodic data, analyzing the classified information based on a supervised learning mode.
4. The method of claim 1, wherein generating a fault scenario from system resource information in the identified anomalous time period comprises:
determining at least one abnormal index according to the system resource information in the identified abnormal time period;
and performing combined operation on the at least one abnormal index to generate a fault scene.
5. The method of claim 1, wherein the fault scenario simulation operation comprises at least one of:
the method comprises the following steps of central processing unit CPU fault scene simulation operation, memory fault scene simulation operation and input/output IO fault scene simulation operation.
6. The method of claim 1, wherein the historical monitoring data and the obtained monitoring data are from different versions of a service system.
7. The method of claim 1, wherein the historical monitoring data is from a same version of a service system as the obtained monitoring data.
8. The method of claim 1, further comprising:
and updating the identification model according to the identified abnormal time period.
9. The method according to claim 1, wherein the service system is based on a Linux system, and performing fault scenario simulation operation on a simulation service system according to the fault scenario comprises:
and performing fault scene simulation operation on the simulation service system according to the fault scene based on a kernel process control mechanism of the Linux system.
10. An apparatus for simulating a fault scenario, the apparatus comprising:
an information monitoring unit, configured to obtain monitoring data of a service system, where the monitoring data includes: application service information and system resource information;
an exception identifying unit, configured to identify an exception time period in the acquired application service information and system resource information according to a preset identification model, where the identification model is constructed based on historical monitoring data, and the historical monitoring data includes: historical application service information and historical system resource information;
a fault scenario generation unit, configured to generate a fault scenario according to the system resource information in the identified abnormal time period, where the fault scenario includes: at least one anomaly indicator derived from system resource information in the anomaly time period;
and the fault scene simulation unit is used for carrying out fault scene simulation operation on the simulation service system according to the fault scene.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 9 are implemented when the processor executes the program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011538845.8A CN112631850B (en) | 2020-12-23 | 2020-12-23 | Fault scene simulation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011538845.8A CN112631850B (en) | 2020-12-23 | 2020-12-23 | Fault scene simulation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112631850A true CN112631850A (en) | 2021-04-09 |
CN112631850B CN112631850B (en) | 2024-07-02 |
Family
ID=75321838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011538845.8A Active CN112631850B (en) | 2020-12-23 | 2020-12-23 | Fault scene simulation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112631850B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114721825A (en) * | 2022-04-07 | 2022-07-08 | 苏州浪潮智能科技有限公司 | Simulation method and system for sub-health fault, computer equipment and medium |
CN116069638A (en) * | 2023-01-19 | 2023-05-05 | 蔷薇大树科技有限公司 | Method for simulating distributed abnormal state based on kernel mode |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763039A (en) * | 2018-04-02 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of traffic failure analogy method, device and equipment |
CN109559583A (en) * | 2017-09-27 | 2019-04-02 | 华为技术有限公司 | Failure simulation method and its device |
CN109933479A (en) * | 2017-12-19 | 2019-06-25 | 杭州华为数字技术有限公司 | Fault simulation and emulation mode and relevant device |
CN111930548A (en) * | 2020-08-12 | 2020-11-13 | 湖南快乐阳光互动娱乐传媒有限公司 | Fault simulation system of multi-cluster distributed service |
-
2020
- 2020-12-23 CN CN202011538845.8A patent/CN112631850B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109559583A (en) * | 2017-09-27 | 2019-04-02 | 华为技术有限公司 | Failure simulation method and its device |
CN109933479A (en) * | 2017-12-19 | 2019-06-25 | 杭州华为数字技术有限公司 | Fault simulation and emulation mode and relevant device |
CN108763039A (en) * | 2018-04-02 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of traffic failure analogy method, device and equipment |
CN111930548A (en) * | 2020-08-12 | 2020-11-13 | 湖南快乐阳光互动娱乐传媒有限公司 | Fault simulation system of multi-cluster distributed service |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114721825A (en) * | 2022-04-07 | 2022-07-08 | 苏州浪潮智能科技有限公司 | Simulation method and system for sub-health fault, computer equipment and medium |
CN116069638A (en) * | 2023-01-19 | 2023-05-05 | 蔷薇大树科技有限公司 | Method for simulating distributed abnormal state based on kernel mode |
CN116069638B (en) * | 2023-01-19 | 2023-09-01 | 蔷薇大树科技有限公司 | Method for simulating distributed abnormal state based on kernel mode |
Also Published As
Publication number | Publication date |
---|---|
CN112631850B (en) | 2024-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10515104B2 (en) | Updating natural language interfaces by processing usage data | |
CN109086199B (en) | Method, terminal and storage medium for automatically generating test script | |
CN111460111B (en) | Retraining recommendations to evaluate automated dialog services | |
CN112783793B (en) | Automatic interface test system and method | |
CN112631850B (en) | Fault scene simulation method and device | |
CN111931809A (en) | Data processing method and device, storage medium and electronic equipment | |
US10984781B2 (en) | Identifying representative conversations using a state model | |
CN109542737A (en) | Platform alert processing method, device, electronic device and storage medium | |
US12026592B2 (en) | Machine learning model training method and machine learning model training device | |
CN112860525A (en) | Node fault prediction method and device in distributed system | |
CN112035325A (en) | Automatic monitoring method and device for text robot | |
CN111582341A (en) | User abnormal operation prediction method and device | |
CN115905450A (en) | Unmanned aerial vehicle monitoring-based water quality abnormity tracing method and system | |
CN108762684B (en) | Hot spot data migration flow control method and device, electronic equipment and storage medium | |
CN112711483A (en) | High-concurrency method, system and equipment for processing big data annotation service | |
CN107544248B (en) | Task optimization method and device in mobile robot | |
CN116244202A (en) | Automatic performance test method and device | |
CN114840421A (en) | Log data processing method and device | |
KR102609946B1 (en) | Appratus and method for processing of program code | |
CN115859157A (en) | Client classification method and device | |
CN113486214B (en) | Music matching method, device, computer equipment and storage medium | |
CN112232960B (en) | Transaction application system monitoring method and device | |
CN115185625A (en) | Self-recommendation type interface updating method based on configurable card and related equipment thereof | |
CN113628077A (en) | Method for generating non-repeated examination questions, terminal and readable storage medium | |
CN113407180A (en) | Configuration page generation method, system, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |