CN117851199A - Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching - Google Patents

Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching Download PDF

Info

Publication number
CN117851199A
CN117851199A CN202311706787.9A CN202311706787A CN117851199A CN 117851199 A CN117851199 A CN 117851199A CN 202311706787 A CN202311706787 A CN 202311706787A CN 117851199 A CN117851199 A CN 117851199A
Authority
CN
China
Prior art keywords
fault
module
data
model
intelligent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311706787.9A
Other languages
Chinese (zh)
Inventor
华成裕
张亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202311706787.9A priority Critical patent/CN117851199A/en
Publication of CN117851199A publication Critical patent/CN117851199A/en
Pending legal-status Critical Current

Links

Landscapes

  • Testing And Monitoring For Control Systems (AREA)

Abstract

The invention relates to disaster tolerance arrangement based on intelligent AI driving fault identification and switching, which belongs to the technical field of cloud computing and comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module.

Description

Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching
Technical Field
The invention belongs to the technical field of cloud computing, and particularly relates to disaster recovery arrangement based on intelligent AI-driven fault identification and switching.
Background
In a common public cloud (AWS Microsoft Google ali cloud, etc.) scene, the fault is monitored through monitoring the state of the resource (such as detection of four or seven layers, whether the address is reachable, etc.) and the network, and when the state of the resource becomes abnormal, an alarm is sent to perform automatic switching or manual switching.
In the conventional common fault identification monitoring and switching services, the network, access state, log and the like of resources are monitored, when errors occur, faults are managed, the mode can cause the unavailability of the application of a client for a period of time, the automatic or manual operation can be performed only when the faults occur, the integral switching operation rule and arrangement are simpler, the increasingly complex service architecture scene is difficult to meet, in addition, due to the introduction of the manual switching operation, the factors such as skill accumulation of switching operation and maintenance personnel, the accuracy of judgment and the like are needed to be relied on, and the switching time is long, so that the continuity of the service is seriously influenced.
Disclosure of Invention
In view of the shortcomings of the prior art, the invention aims to provide disaster recovery arrangement based on intelligent AI driving fault identification and switching, comprehensively realize fault prediction, fault identification and switching fault tolerance mechanisms by introducing AI learning and prediction capability and combining more external data, hardware data and software data, and realize efficient fault identification and automatic switching through AI driving intelligent fault monitoring and analysis for public cloud disaster recovery platforms, so as to improve reliability and availability of cloud services, better reduce RTO time and greatly improve application availability.
The invention provides disaster recovery arrangement based on intelligent AI-driven fault identification and switching, which comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;
the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;
the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;
the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;
the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;
the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;
the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;
the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.
Further, the data of the fault data and the recovery operation when the history occurs comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and further comprise external and hardware data of local time nodes.
Further, the external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.
Further, the characteristics of the predicted faults comprise sudden increase of CPU utilization rate, memory leakage, increase of network delay, natural disasters of areas where the machine room is located, optical fiber damage, temperature increase of the machine room and temperature increase of a physical machine.
Further, the AI model includes a supervised learning model and an unsupervised learning model.
Further, the supervised learning model includes a classifier and the unsupervised learning model includes anomaly detection.
Further, the identifying and creating features that help AI models predict faults specifically includes the steps of:
s1: cleaning and preprocessing the collected offline data;
s2: extracting and converting the characteristics of the data according to the domain knowledge and experience of the problem;
s3: and selecting the characteristics, namely selecting the characteristics with better correlation to the prediction targets.
Further, the cleaning and preprocessing includes processing missing values, outliers.
Further, in S3, feature selection is performed by correlation coefficient, analysis of variance, and feature importance in model training.
Further, the model training and validation module trains the AI model using a decision tree model.
The invention has the following beneficial effects:
(1) The learning capacity of the AI model combined by the method improves the success rate of predicting the software service faults and reduces the duration of the faults by combining offline training with real-time learning.
(2) The invention can aim at the resource and manual operation of different users, the model can carry out personalized fault learning and prediction aiming at the application of different users and different scenes by the result of each fault identification and manual operation and related monitoring data, and the misjudgment rate of the general model is reduced.
(3) According to the invention, different automatic/manual actions are performed on the prediction result of the model through the step-type threshold configuration, so that the stable operation of the whole system is ensured.
(4) The invention combines the data factors of hardware and environment, greatly improves the accuracy rate and the pre-judgment of fault identification, such as fire, earthquake, flood and the like in a certain area, and even if the current application service is available, the model can improve the probability of fault occurrence in the short time future so as to discover the fault problem in the first time.
(5) According to the invention, the switching and recovery of faults are carried out by combining an automatic mode with a manual mode, so that on one hand, the switching and recovery of faults can be carried out rapidly, the time of human waiting and intervention is reduced, and on the other hand, the feasibility of manual operation is given to a certain extent.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views. It is apparent that the drawings in the following description are only some of the embodiments described in the embodiments of the present invention, and that other drawings may be obtained from these drawings by those of ordinary skill in the art.
Fig. 1 is a schematic diagram of an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the embodiments of the present invention better understood by those skilled in the art, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.
In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
In the description of the present invention, it should be noted that unless explicitly stated and limited otherwise, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The terms "mounted," "connected," "coupled," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of methods and systems that are consistent with aspects of the invention as detailed in the accompanying claims.
The invention provides disaster recovery arrangement based on intelligent AI driving, which is used for solving the problems that in the conventional common fault identification monitoring and switching service, the network, access state, log and the like of resources are monitored, when errors occur, the fault management is carried out, the mode can cause that the application of a client is unavailable for a period of time, the automatic or manual operation can be carried out only when the faults occur, the integral switching operation rule and arrangement are simpler, and increasingly complex service architecture scenes are difficult to meet.
To assist the person in understanding the invention, the names appearing herein are explained:
MDRS, multi-active Disaster Recovery Service Multi-activity disaster recovery service;
AI Artificial Intelligence artificial intelligence;
RTO Recovery Time Objective refers to the time required for a service to revert to an acceptable service level after a system failure occurs;
in order to explain the present invention, as shown in fig. 1, the following embodiment is proposed:
embodiment 1
The disaster tolerance arrangement based on intelligent AI-driven fault identification and switching comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;
the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;
the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;
the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;
the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;
the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;
the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;
the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.
The data of fault data and recovery operation when the history happens comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and also comprise external and hardware data of local time nodes.
The external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.
Embodiment 2
The disaster tolerance arrangement based on intelligent AI-driven fault identification and switching comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;
the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;
the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;
the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;
the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;
the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;
the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;
the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.
The data of fault data and recovery operation when the history happens comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and also comprise external and hardware data of local time nodes.
The external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.
The characteristics of the predicted faults comprise sudden increase of CPU utilization rate, memory leakage, increase of network delay, natural disasters of areas where the machine room is located, optical fiber damage, temperature increase of the machine room and temperature increase of a physical machine.
The AI model includes a supervised learning model and an unsupervised learning model.
The supervised learning model includes a classifier and the unsupervised learning model includes anomaly detection.
Embodiment 3
The disaster tolerance arrangement based on intelligent AI-driven fault identification and switching comprises an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault identification module, an automatic switching and recovery module and a learning optimization module;
the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;
the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;
the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;
the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;
the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;
the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;
the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.
The data of fault data and recovery operation when the history happens comprise logs of public cloud resources, system indexes, user behaviors, network activities and events, and also comprise external and hardware data of local time nodes.
The external and hardware data comprise weather, natural disasters, construction optical fibers, water, electricity and temperature of a machine room, temperature of a physical machine and rotating speed of a fan.
The characteristics of the predicted faults comprise sudden increase of CPU utilization rate, memory leakage, increase of network delay, natural disasters of areas where the machine room is located, optical fiber damage, temperature increase of the machine room and temperature increase of a physical machine.
The AI model includes a supervised learning model and an unsupervised learning model.
The supervised learning model includes a classifier and the unsupervised learning model includes anomaly detection.
The identification and creation of features that assist the AI model in predicting failure specifically includes the steps of:
s1: cleaning and preprocessing the collected offline data;
s2: extracting and converting the characteristics of the data according to the domain knowledge and experience of the problem;
s3: and selecting the characteristics, namely selecting the characteristics with better correlation to the prediction targets.
The cleaning and preprocessing includes processing missing values and outliers.
In the step S3, feature selection is performed through the correlation coefficient, analysis of variance and feature importance in model training.
The model training and validation module trains the AI model using a decision tree model.
In summary, the present invention is to solve the above problems, and perform optimization and upgrade from three directions.
According to the invention, the fault is comprehensively identified and judged based on the prediction of the AI on the fault and the rule configuration carried out manually, and according to the fault identification strategy, the automatic fault switching and recovery is combined with various capacities of data, intelligent, manual, switching/recovery and continuous learning compared with the existing disaster recovery system.
In a second aspect, fault identification intelligence and generalization: the AI is used for training the fault samples, supporting personalized sample collection and training of different types of resources (databases and application software services), and predicting the fault through a model after training, so that the fault can be identified before or at the moment of the fault occurrence, and automatic fault switching is performed, and firstly, the strong dependence of decoupling on the skills of personnel is achieved, secondly, the switching efficiency is improved, and the fault time is reduced.
In a third aspect, the extended monitoring dimension:
(1) Hardware monitoring: the sensors monitor the hardware environment temperature, humidity and hydropower of the machine room in each place.
(2) And (3) software monitoring: the software environment, operating system, application process, port, state of the whole cluster, network, and state of various middleware of the client deployment are monitored.
(3) Cloud resource monitoring: and carrying out nanotube monitoring on various cloud resources provided by public cloud manufacturers.
(4) And (3) self-owned resource monitoring: and detecting and monitoring the self resources of the user through the agent mode.
(5) And (3) environmental monitoring: external factors, natural disasters, weather, construction and the like of machine rooms in the same city, different places and the like of customer deployment are monitored.
(6) Through the multi-dimensional monitoring data acquisition, the monitoring instantaneity, accuracy and multi-dimension of the system are ensured, the model training of the AI is combined, and the fault prediction and recognition can be performed earlier and more accurately.
And, for further explanation of the present invention, another embodiment is provided:
embodiment 4
The invention comprises the following functional modules:
offline data collection: failure data and recovery operation data when historic occurs are collected. This should include logs of public cloud resources, system metrics, user behavior, network activity, events, and the like. And includes external/hardware data of local time nodes (weather, natural disasters, construction fibers, water, electricity, temperature of machine room, temperature of physical machine, rotational speed of fan). All of the above data can be used to train and test AI models offline.
And (3) constructing a characteristic engineering: at this stage, we need to identify and create features that can help AI models predict faults. For example, sudden increases in CPU utilization, memory leaks, increases in network delays, natural disasters in the area of the machine room, fiber damage, elevated temperatures in the machine room, elevated temperatures in the physical machine, etc., may be predictive of failure.
Model training and verification: using the collected data and features, one or more AI models may be trained to predict faults. This may include supervised learning models (e.g., classifiers) and unsupervised learning models (e.g., anomaly detection). After training the model, we need to test the performance of the model using the validation dataset.
And (3) fault prediction: the trained models are deployed into a production environment to monitor the state of the system and predict possible failures in real time or periodically.
And (3) fault identification: based on the prediction of the AI on the fault and the rule configuration performed manually, the identification and judgment of the fault are comprehensively performed. If the rule of manual configuration is hit, 100% of faults occur, and switching and recovering operations of the faults are performed.
Aiming at the AI prediction result, a ladder fraction mechanism is implemented, and in order to facilitate the system operation, the system divides 3 ladder fractions.
For example, if the score is greater than 90 scores (more than 90 scores are considered to be very reliable in model prediction and can automatically perform subsequent operation), automatic switching and recovery of the total faults are performed, and meanwhile, an alarm is given.
For example, if the score is greater than 80 (80-90), the model is considered to be relatively reliable, full automatic operation is not performed, half operation is performed first), then 50% of the flow is automatically switched and recovered in case of failure, and an alarm is given.
For example, if the score is greater than 70 (between 70 and 80 as a reference value to remind the user, the user manually decides the subsequent operation) threshold (different scenes or different resources), an alarm is given, and the operation of switching and recovering the fault is waited for manually.
Automated handoff and recovery: according to the above failure recognition strategy, automated failover and recovery is performed. Including traffic redirection back-up areas, restarting failed services, or restoring the state of the system using the back-up data.
Continuous learning and optimization: the system should be able to collect and analyze data of the fail-over and recovery process in order to continually improve the AI model and the switching strategy.
If a fault is not successfully predicted or a switching strategy is not working as expected, then the human feedback is a fault. Then automatically recording various condition data (external data, system data, software data, hardware data, etc.) occurring in the vicinity of the time point at that time
And feeding the related data back to the algorithm model labeling personnel for confirming the data and the result. If the model is in fault, training the model again as data of the model upgrading requirement. If false, discarding the data, and preventing the accuracy of the model from being affected.
In addition, regarding specific steps of AI model training, an embodiment is provided in the present invention:
embodiment 5
Firstly, through the collection and cleaning of offline data, the training of a fault prediction model is carried out;
meanwhile, according to process data and result data generated by fault identification in the production environment and manual operation data of an end user, real-time model training and optimization are carried out so as to ensure the prediction effect of the model;
the construction of the feature engineering refers to the following flow:
first, the collected offline data is cleaned and preprocessed, including processing missing values, outliers, etc.
Next, the data is feature extracted and transformed based on domain knowledge and experience of the problem. The time and frequency characteristics of the data may be described using statistical features, timing features, frequency domain features, etc. Features such as sliding windows, hysteresis statistics and the like can be constructed by utilizing historical information of the data to capture trends and volatility of the data.
And selecting features with good correlation to the predicted targets, and screening by adopting methods such as correlation coefficients, analysis of variance, feature importance in model training and the like.
The algorithm model uses a decision tree algorithm, and the specific algorithm and implementation process are as follows:
data collection and preparation:
relevant offline data is collected, including various features and related information at the time of failure.
And cleaning and preprocessing the data, and processing missing values, abnormal values and the like.
Characteristic engineering:
and extracting and converting the characteristics of the data according to the background and the requirements of the problem.
The time and frequency characteristics of the data may be described using statistical features, timing features, frequency domain features, etc.
And utilizing historical information of the data to construct characteristics such as sliding windows, hysteresis statistics and the like to capture the trend and the volatility of the data.
Data set partitioning:
the processed data set is divided into a training set and a test set, typically using 70% of the data as the training set and 30% of the data as the test set.
Training a decision tree model:
and C4.5 decision tree algorithm is selected for model training.
In the training process, the decision tree automatically selects the optimal characteristics for node division so as to furthest improve the prediction accuracy of the model.
Model evaluation and tuning:
and evaluating the trained model by using a test set, and calculating indexes such as accuracy, recall rate, F1 value and the like.
Parameters, feature selection and the like of the model can be adjusted according to the evaluation result, and the performance of the model is further optimized.
In addition, regarding AI failure recognition and handover, an embodiment is provided in the present invention:
embodiment 6
Real-time-resource data monitoring: the state acquisition of the resources in the user resource pool is performed through real-time heartbeat, and the state acquisition comprises (server state, network state, application port state, database state and the like).
Real-time-external/hardware data monitoring: and the external interface and the sensor are in butt joint, and data of the region or the machine room corresponding to the user are collected in real time, wherein the data comprise weather, natural disaster grades, temperature, humidity and electric load conditions of the machine room, temperature of the server and the like.
In addition, regarding the detailed steps of fault identification, an embodiment is provided in the present invention:
embodiment 7
And (3) fault identification: and after the real-time data acquisition is completed, respectively carrying out rule judgment and AI model prediction to carry out fault identification probability output.
The detailed flow is as follows:
(1) If the rule of manual configuration is hit (the manual rule configuration supports indexes of all collected data, so that a user can configure the manual configuration, such as the temperature of a machine room, whether geological disasters occur, network bandwidth occupation, CPU, memory, load and the like), 100% of the manual configuration is failure, and then the operation of switching and recovering the failure is performed.
(2) Aiming at the AI prediction result, a ladder fraction mechanism is implemented, and in order to facilitate the system operation, the system divides 3 ladder fractions.
For example, if the score is greater than 90 scores (more than 90 scores are considered to be very reliable in model prediction and can automatically perform subsequent operation), automatic switching and recovery of the total faults are performed, and meanwhile, an alarm is given.
For example, if the score is greater than 80 (80-90), the model is considered to be relatively reliable, full automatic operation is not performed, half operation is performed first), then 50% of the flow is automatically switched and recovered in case of failure, and an alarm is given.
For example, if the score is greater than 70 (between 70 and 80 as a reference value to remind the user, the user manually decides the subsequent operation) threshold (different scenes or different resources), an alarm is given, and the operation of switching and recovering the fault is waited for manually.
In addition, regarding failover, an embodiment is provided in the present invention:
embodiment 7
(1) And if the rule and the prediction result under the AI specified threshold are met, immediately performing automatic switching of the fault.
(2) If the AI threshold is low, but there may be a risk, an alarm is given to initiate manual failover and recovery.
(3) The system supports switching and recovery of various scenes, and specifically comprises the following steps:
(1) and (3) switching: supporting the sending of corresponding automatic switching instructions such as load balancing, flow and the like;
(2) and (5) recovering: instruction transmission supporting data recovery in the same city and different places.
The specific switching execution and data recovery are provided by the capability of the corresponding middleware (load balancing, mysql database and the like), and the system only sends and executes instructions.
In summary, in the invention, the traditional strategy of monitoring, alarming and post-fault switching and recovery is combined, the learning and prediction of AI on faults are increased, the influence of faults on application programs can be greatly improved, RTO time is greatly reduced, and the economic influence on clients and the like is reduced.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the embodiments of the present invention, and are not limiting. Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the invention, and any changes and substitutions that would be apparent to one skilled in the art are intended to be included within the scope of the present invention.

Claims (10)

1. The disaster recovery arrangement based on intelligent AI-driven fault recognition and switching is characterized by comprising an offline data collection module, a construction feature engineering module, a model training and verification module, a fault prediction module, a fault recognition module, an automatic switching and recovery module and a learning optimization module;
the off-line data collection module is used for collecting fault data and recovery operation data when historic occurrence occurs and is used for off-line training and testing of an AI model;
the construction feature engineering module is used for identifying and creating features for helping the AI model to predict faults;
the model training and verifying module is used for predicting faults through characteristic and data training of one or more AI models and verifying the AI models;
the fault prediction module is used for deploying the trained model into a production environment, monitoring the state of the system in real time or periodically and predicting faults;
the fault identification module is used for carrying out fault identification judgment through the prediction of an AI model on the fault and the rule configuration carried out manually;
the automatic switching and recovering module is used for performing automatic fault switching and recovering through a fault identification strategy;
the learning optimization module is used for collecting and analyzing data of the fault switching and recovery process.
2. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 1, wherein the historical data on faults and recovery operations includes logs of public cloud resources, system metrics, user behavior, network activity, events, and external and hardware data of local time nodes.
3. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 2, wherein the external and hardware data includes weather, natural disasters, construction fibers, water, electricity, temperature of machine room, temperature of physical machine, rotational speed of fans.
4. The intelligent AI-driven fault identification and switchover based disaster recovery orchestration of claim 1, wherein the predictive fault signature includes sudden increases in CPU utilization, memory leaks, increases in network delay, natural disasters in the area of the machine room, fiber damage, temperature increases in the machine room, temperature increases in the physical machines.
5. The intelligent AI-driven fault identification and switchover-based disaster recovery orchestration of claim 1, wherein the AI models comprise a supervised learning model and an unsupervised learning model.
6. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 5 wherein the supervised learning model comprises a classifier and the unsupervised learning model comprises anomaly detection.
7. The intelligent AI-driven fault-recognition and switchover-based disaster-tolerant orchestration of claim 1, wherein the identifying and creating features that assist AI models in predicting faults specifically comprises the steps of:
s1: cleaning and preprocessing the collected offline data;
s2: extracting and converting the characteristics of the data according to the domain knowledge and experience of the problem;
s3: and selecting the characteristics, namely selecting the characteristics relevant to the prediction targets.
8. The intelligent AI-driven fault-recognition and switchover-based disaster-tolerant orchestration of claim 7, wherein the cleaning and preprocessing includes handling missing values, outliers.
9. The intelligent AI-driven fault-recognition and switchover-based disaster recovery orchestration of claim 7, wherein in S3, feature selection is performed by correlation coefficients, analysis of variance, feature importance in model training.
10. The intelligent AI-driven fault identification and switchover-based disaster recovery orchestration of claim 1, wherein the model training and verification module trains AI models using decision tree models.
CN202311706787.9A 2023-12-13 2023-12-13 Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching Pending CN117851199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311706787.9A CN117851199A (en) 2023-12-13 2023-12-13 Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311706787.9A CN117851199A (en) 2023-12-13 2023-12-13 Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching

Publications (1)

Publication Number Publication Date
CN117851199A true CN117851199A (en) 2024-04-09

Family

ID=90541000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311706787.9A Pending CN117851199A (en) 2023-12-13 2023-12-13 Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching

Country Status (1)

Country Link
CN (1) CN117851199A (en)

Similar Documents

Publication Publication Date Title
CN107179957B (en) Physical machine fault classification processing method and device and virtual machine recovery method and system
CN110147387B (en) Root cause analysis method, root cause analysis device, root cause analysis equipment and storage medium
CN102231681B (en) High availability cluster computer system and fault treatment method thereof
CN112187514A (en) Intelligent operation and maintenance system, method and terminal for data center network equipment
CN101197621B (en) Method and system for remote diagnosing and locating failure of network management system
CN113836044B (en) Method and system for collecting and analyzing software faults
CN110716842B (en) Cluster fault detection method and device
CN104796273A (en) Method and device for diagnosing root of network faults
US10983855B2 (en) Interface for fault prediction and detection using time-based distributed data
CN103116531A (en) Storage system failure predicting method and storage system failure predicting device
US20220321436A1 (en) Method and apparatus for managing prediction of network anomalies
CN102902615B (en) A kind of Lustre parallel file system false alarm method and system thereof
CN112783682B (en) Abnormal automatic repairing method based on cloud mobile phone service
CN115809183A (en) Method for discovering and disposing information-creating terminal fault based on knowledge graph
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
US11934855B2 (en) System and method to autonomously manage hybrid information technology (IT) infrastructure
CN115877198A (en) Primary and secondary fusion switch fault diagnosis early warning system based on edge calculation
CN113726553A (en) Node fault recovery method and device, electronic equipment and readable storage medium
CN106875018B (en) Method and device for automatic maintenance of super-large-scale machine
US20170244252A1 (en) Autonomous Operational Platform for Micro-Grid Energy Management
CN113298672A (en) Commercial power fault monitoring method, device, system, storage medium and electronic equipment
CN112965990A (en) Low-voltage contact cabinet fault solution generation method and device
CN117135343A (en) Fault analysis method, device, equipment and storage medium
CN117851199A (en) Disaster recovery arrangement based on intelligent AI-driven fault recognition and switching
KR102509380B1 (en) Methods for learning application transactions and predicting and resolving real-time failures through machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination