CN117851199A

CN117851199A - Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching

Info

Publication number: CN117851199A
Application number: CN202311706787.9A
Authority: CN
Inventors: 华成裕; 张亮
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-04-09

Abstract

The invention relates to disaster tolerance arrangement based on intelligent AI driving fault identification and switching, which belongs to the technical field of cloud computing and comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module.

Description

Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching

Technical Field

The invention belongs to the technical field of cloud computing, and particularly relates to disaster recovery arrangement based on intelligent AI-driven fault identification and switching.

Background

In a common public cloud (AWS Microsoft Google ali cloud, etc.) scene, the fault is monitored through monitoring the state of the resource (such as detection of four or seven layers, whether the address is reachable, etc.) and the network, and when the state of the resource becomes abnormal, an alarm is sent to perform automatic switching or manual switching.

In the conventional common fault identification monitoring and switching services, the network, access state, log and the like of resources are monitored, when errors occur, faults are managed, the mode can cause the unavailability of the application of a client for a period of time, the automatic or manual operation can be performed only when the faults occur, the integral switching operation rule and arrangement are simpler, the increasingly complex service architecture scene is difficult to meet, in addition, due to the introduction of the manual switching operation, the factors such as skill accumulation of switching operation and maintenance personnel, the accuracy of judgment and the like are needed to be relied on, and the switching time is long, so that the continuity of the service is seriously influenced.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention aims to provide disaster recovery arrangement based on intelligent AI driving fault identification and switching, comprehensively realize fault prediction, fault identification and switching fault tolerance mechanisms by introducing AI learning and prediction capability and combining more external data, hardware data and software data, and realize efficient fault identification and automatic switching through AI driving intelligent fault monitoring and analysis for public cloud disaster recovery platforms, so as to improve reliability and availability of cloud services, better reduce RTO time and greatly improve application availability.

The invention provides disaster recovery arrangement based on intelligent AI-driven fault identification and switching, which comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;

the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;

the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;

the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;

the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;

the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;

the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;

the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.

Further, the data of the fault data and the recovery operation when the history occurs comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and further comprise external and hardware data of local time nodes.

Further, the external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.

Further, the characteristics of the predicted faults comprise sudden increase of CPU utilization rate, memory leakage, increase of network delay, natural disasters of areas where the machine room is located, optical fiber damage, temperature increase of the machine room and temperature increase of a physical machine.

Further, the AI model includes a supervised learning model and an unsupervised learning model.

Further, the supervised learning model includes a classifier and the unsupervised learning model includes anomaly detection.

Further, the identifying and creating features that help AI models predict faults specifically includes the steps of:

s1: cleaning and preprocessing the collected offline data;

s2: extracting and converting the characteristics of the data according to the domain knowledge and experience of the problem;

s3: and selecting the characteristics, namely selecting the characteristics with better correlation to the prediction targets.

Further, the cleaning and preprocessing includes processing missing values, outliers.

Further, in S3, feature selection is performed by correlation coefficient, analysis of variance, and feature importance in model training.

Further, the model training and validation module trains the AI model using a decision tree model.

The invention has the following beneficial effects:

(1) The learning capacity of the AI model combined by the method improves the success rate of predicting the software service faults and reduces the duration of the faults by combining offline training with real-time learning.

(2) The invention can aim at the resource and manual operation of different users, the model can carry out personalized fault learning and prediction aiming at the application of different users and different scenes by the result of each fault identification and manual operation and related monitoring data, and the misjudgment rate of the general model is reduced.

(3) According to the invention, different automatic/manual actions are performed on the prediction result of the model through the step-type threshold configuration, so that the stable operation of the whole system is ensured.

(4) The invention combines the data factors of hardware and environment, greatly improves the accuracy rate and the pre-judgment of fault identification, such as fire, earthquake, flood and the like in a certain area, and even if the current application service is available, the model can improve the probability of fault occurrence in the short time future so as to discover the fault problem in the first time.

(5) According to the invention, the switching and recovery of faults are carried out by combining an automatic mode with a manual mode, so that on one hand, the switching and recovery of faults can be carried out rapidly, the time of human waiting and intervention is reduced, and on the other hand, the feasibility of manual operation is given to a certain extent.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views. It is apparent that the drawings in the following description are only some of the embodiments described in the embodiments of the present invention, and that other drawings may be obtained from these drawings by those of ordinary skill in the art.

Fig. 1 is a schematic diagram of an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the embodiments of the present invention better understood by those skilled in the art, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.

In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

In the description of the present invention, it should be noted that unless explicitly stated and limited otherwise, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The terms "mounted," "connected," "coupled," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of methods and systems that are consistent with aspects of the invention as detailed in the accompanying claims.

The invention provides disaster recovery arrangement based on intelligent AI driving, which is used for solving the problems that in the conventional common fault identification monitoring and switching service, the network, access state, log and the like of resources are monitored, when errors occur, the fault management is carried out, the mode can cause that the application of a client is unavailable for a period of time, the automatic or manual operation can be carried out only when the faults occur, the integral switching operation rule and arrangement are simpler, and increasingly complex service architecture scenes are difficult to meet.

To assist the person in understanding the invention, the names appearing herein are explained:

MDRS, multi-active Disaster Recovery Service Multi-activity disaster recovery service;

AI Artificial Intelligence artificial intelligence;

RTO Recovery Time Objective refers to the time required for a service to revert to an acceptable service level after a system failure occurs;

in order to explain the present invention, as shown in fig. 1, the following embodiment is proposed:

embodiment 1

The disaster tolerance arrangement based on intelligent AI-driven fault identification and switching comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;

The data of fault data and recovery operation when the history happens comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and also comprise external and hardware data of local time nodes.

The external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.

Embodiment 2

The characteristics of the predicted faults comprise sudden increase of CPU utilization rate, memory leakage, increase of network delay, natural disasters of areas where the machine room is located, optical fiber damage, temperature increase of the machine room and temperature increase of a physical machine.

The AI model includes a supervised learning model and an unsupervised learning model.

The supervised learning model includes a classifier and the unsupervised learning model includes anomaly detection.

Embodiment 3

The identification and creation of features that assist the AI model in predicting failure specifically includes the steps of:

s1: cleaning and preprocessing the collected offline data;

The cleaning and preprocessing includes processing missing values and outliers.

In the step S3, feature selection is performed through the correlation coefficient, analysis of variance and feature importance in model training.

The model training and validation module trains the AI model using a decision tree model.

In summary, the present invention is to solve the above problems, and perform optimization and upgrade from three directions.

According to the invention, the fault is comprehensively identified and judged based on the prediction of the AI on the fault and the rule configuration carried out manually, and according to the fault identification strategy, the automatic fault switching and recovery is combined with various capacities of data, intelligent, manual, switching/recovery and continuous learning compared with the existing disaster recovery system.

In a second aspect, fault identification intelligence and generalization: the AI is used for training the fault samples, supporting personalized sample collection and training of different types of resources (databases and application software services), and predicting the fault through a model after training, so that the fault can be identified before or at the moment of the fault occurrence, and automatic fault switching is performed, and firstly, the strong dependence of decoupling on the skills of personnel is achieved, secondly, the switching efficiency is improved, and the fault time is reduced.

In a third aspect, the extended monitoring dimension:

(1) Hardware monitoring: the sensors monitor the hardware environment temperature, humidity and hydropower of the machine room in each place.

(2) And (3) software monitoring: the software environment, operating system, application process, port, state of the whole cluster, network, and state of various middleware of the client deployment are monitored.

(3) Cloud resource monitoring: and carrying out nanotube monitoring on various cloud resources provided by public cloud manufacturers.

(4) And (3) self-owned resource monitoring: and detecting and monitoring the self resources of the user through the agent mode.

(5) And (3) environmental monitoring: external factors, natural disasters, weather, construction and the like of machine rooms in the same city, different places and the like of customer deployment are monitored.

(6) Through the multi-dimensional monitoring data acquisition, the monitoring instantaneity, accuracy and multi-dimension of the system are ensured, the model training of the AI is combined, and the fault prediction and recognition can be performed earlier and more accurately.

And, for further explanation of the present invention, another embodiment is provided:

embodiment 4

The invention comprises the following functional modules:

offline data collection: failure data and recovery operation data when historic occurs are collected. This should include logs of public cloud resources, system metrics, user behavior, network activity, events, and the like. And includes external/hardware data of local time nodes (weather, natural disasters, construction fibers, water, electricity, temperature of machine room, temperature of physical machine, rotational speed of fan). All of the above data can be used to train and test AI models offline.

And (3) constructing a characteristic engineering: at this stage, we need to identify and create features that can help AI models predict faults. For example, sudden increases in CPU utilization, memory leaks, increases in network delays, natural disasters in the area of the machine room, fiber damage, elevated temperatures in the machine room, elevated temperatures in the physical machine, etc., may be predictive of failure.

Model training and verification: using the collected data and features, one or more AI models may be trained to predict faults. This may include supervised learning models (e.g., classifiers) and unsupervised learning models (e.g., anomaly detection). After training the model, we need to test the performance of the model using the validation dataset.

And (3) fault prediction: the trained models are deployed into a production environment to monitor the state of the system and predict possible failures in real time or periodically.

And (3) fault identification: based on the prediction of the AI on the fault and the rule configuration performed manually, the identification and judgment of the fault are comprehensively performed. If the rule of manual configuration is hit, 100% of faults occur, and switching and recovering operations of the faults are performed.

Aiming at the AI prediction result, a ladder fraction mechanism is implemented, and in order to facilitate the system operation, the system divides 3 ladder fractions.

For example, if the score is greater than 90 scores (more than 90 scores are considered to be very reliable in model prediction and can automatically perform subsequent operation), automatic switching and recovery of the total faults are performed, and meanwhile, an alarm is given.

For example, if the score is greater than 80 (80-90), the model is considered to be relatively reliable, full automatic operation is not performed, half operation is performed first), then 50% of the flow is automatically switched and recovered in case of failure, and an alarm is given.

For example, if the score is greater than 70 (between 70 and 80 as a reference value to remind the user, the user manually decides the subsequent operation) threshold (different scenes or different resources), an alarm is given, and the operation of switching and recovering the fault is waited for manually.

Automated handoff and recovery: according to the above failure recognition strategy, automated failover and recovery is performed. Including traffic redirection back-up areas, restarting failed services, or restoring the state of the system using the back-up data.

Continuous learning and optimization: the system should be able to collect and analyze data of the fail-over and recovery process in order to continually improve the AI model and the switching strategy.

If a fault is not successfully predicted or a switching strategy is not working as expected, then the human feedback is a fault. Then automatically recording various condition data (external data, system data, software data, hardware data, etc.) occurring in the vicinity of the time point at that time

And feeding the related data back to the algorithm model labeling personnel for confirming the data and the result. If the model is in fault, training the model again as data of the model upgrading requirement. If false, discarding the data, and preventing the accuracy of the model from being affected.

In addition, regarding specific steps of AI model training, an embodiment is provided in the present invention:

embodiment 5

Firstly, through the collection and cleaning of offline data, the training of a fault prediction model is carried out;

meanwhile, according to process data and result data generated by fault identification in the production environment and manual operation data of an end user, real-time model training and optimization are carried out so as to ensure the prediction effect of the model;

the construction of the feature engineering refers to the following flow:

first, the collected offline data is cleaned and preprocessed, including processing missing values, outliers, etc.

Next, the data is feature extracted and transformed based on domain knowledge and experience of the problem. The time and frequency characteristics of the data may be described using statistical features, timing features, frequency domain features, etc. Features such as sliding windows, hysteresis statistics and the like can be constructed by utilizing historical information of the data to capture trends and volatility of the data.

And selecting features with good correlation to the predicted targets, and screening by adopting methods such as correlation coefficients, analysis of variance, feature importance in model training and the like.

The algorithm model uses a decision tree algorithm, and the specific algorithm and implementation process are as follows:

data collection and preparation:

relevant offline data is collected, including various features and related information at the time of failure.

And cleaning and preprocessing the data, and processing missing values, abnormal values and the like.

Characteristic engineering:

and extracting and converting the characteristics of the data according to the background and the requirements of the problem.

The time and frequency characteristics of the data may be described using statistical features, timing features, frequency domain features, etc.

And utilizing historical information of the data to construct characteristics such as sliding windows, hysteresis statistics and the like to capture the trend and the volatility of the data.

Data set partitioning:

the processed data set is divided into a training set and a test set, typically using 70% of the data as the training set and 30% of the data as the test set.

Training a decision tree model:

and C4.5 decision tree algorithm is selected for model training.

In the training process, the decision tree automatically selects the optimal characteristics for node division so as to furthest improve the prediction accuracy of the model.

Model evaluation and tuning:

and evaluating the trained model by using a test set, and calculating indexes such as accuracy, recall rate, F1 value and the like.

Parameters, feature selection and the like of the model can be adjusted according to the evaluation result, and the performance of the model is further optimized.

In addition, regarding AI failure recognition and handover, an embodiment is provided in the present invention:

embodiment 6

Real-time-resource data monitoring: the state acquisition of the resources in the user resource pool is performed through real-time heartbeat, and the state acquisition comprises (server state, network state, application port state, database state and the like).

Real-time-external/hardware data monitoring: and the external interface and the sensor are in butt joint, and data of the region or the machine room corresponding to the user are collected in real time, wherein the data comprise weather, natural disaster grades, temperature, humidity and electric load conditions of the machine room, temperature of the server and the like.

In addition, regarding the detailed steps of fault identification, an embodiment is provided in the present invention:

embodiment 7

And (3) fault identification: and after the real-time data acquisition is completed, respectively carrying out rule judgment and AI model prediction to carry out fault identification probability output.

The detailed flow is as follows:

(1) If the rule of manual configuration is hit (the manual rule configuration supports indexes of all collected data, so that a user can configure the manual configuration, such as the temperature of a machine room, whether geological disasters occur, network bandwidth occupation, CPU, memory, load and the like), 100% of the manual configuration is failure, and then the operation of switching and recovering the failure is performed.

(2) Aiming at the AI prediction result, a ladder fraction mechanism is implemented, and in order to facilitate the system operation, the system divides 3 ladder fractions.

In addition, regarding failover, an embodiment is provided in the present invention:

embodiment 7

(1) And if the rule and the prediction result under the AI specified threshold are met, immediately performing automatic switching of the fault.

(2) If the AI threshold is low, but there may be a risk, an alarm is given to initiate manual failover and recovery.

(3) The system supports switching and recovery of various scenes, and specifically comprises the following steps:

(1) and (3) switching: supporting the sending of corresponding automatic switching instructions such as load balancing, flow and the like;

(2) and (5) recovering: instruction transmission supporting data recovery in the same city and different places.

The specific switching execution and data recovery are provided by the capability of the corresponding middleware (load balancing, mysql database and the like), and the system only sends and executes instructions.

In summary, in the invention, the traditional strategy of monitoring, alarming and post-fault switching and recovery is combined, the learning and prediction of AI on faults are increased, the influence of faults on application programs can be greatly improved, RTO time is greatly reduced, and the economic influence on clients and the like is reduced.

Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the embodiments of the present invention, and are not limiting. Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the invention, and any changes and substitutions that would be apparent to one skilled in the art are intended to be included within the scope of the present invention.

Claims

1. The disaster recovery arrangement based on intelligent AI-driven fault recognition and switching is characterized by comprising an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault recognition module, an automatic switching and recovery module and a learning optimization module;

2. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 1, wherein the historical data on faults and recovery operations includes logs of public cloud resources, system metrics, user behavior, network activity, events, and external and hardware data of local time nodes.

3. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 2, wherein the external and hardware data includes weather, natural disasters, construction fibers, water, electricity, temperature of machine room, temperature of physical machine, rotational speed of fans.

4. The intelligent AI-driven fault identification and switchover based disaster recovery orchestration of claim 1, wherein the predictive fault signature includes sudden increases in CPU utilization, memory leaks, increases in network delay, natural disasters in the area of the machine room, fiber damage, temperature increases in the machine room, temperature increases in the physical machines.

5. The intelligent AI-driven fault identification and switchover-based disaster recovery orchestration of claim 1, wherein the AI models comprise a supervised learning model and an unsupervised learning model.

6. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 5 wherein the supervised learning model comprises a classifier and the unsupervised learning model comprises anomaly detection.

7. The intelligent AI-driven fault-recognition and switchover-based disaster-tolerant orchestration of claim 1, wherein the identifying and creating features that assist AI models in predicting faults specifically comprises the steps of:

s1: cleaning and preprocessing the collected offline data;

s3: and selecting the characteristics, namely selecting the characteristics relevant to the prediction targets.

8. The intelligent AI-driven fault-recognition and switchover-based disaster-tolerant orchestration of claim 7, wherein the cleaning and preprocessing includes handling missing values, outliers.

9. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 7, wherein in S3, feature selection is performed by correlation coefficients, analysis of variance, feature importance in model training.

10. The intelligent AI-driven fault identification and switchover-based disaster recovery orchestration of claim 1, wherein the model training and verification module trains AI models using decision tree models.