CN118051374A - Intelligent check point recovery method, cloud operating system and computing platform - Google Patents

Intelligent check point recovery method, cloud operating system and computing platform Download PDF

Info

Publication number
CN118051374A
CN118051374A CN202410431888.8A CN202410431888A CN118051374A CN 118051374 A CN118051374 A CN 118051374A CN 202410431888 A CN202410431888 A CN 202410431888A CN 118051374 A CN118051374 A CN 118051374A
Authority
CN
China
Prior art keywords
data
detection point
operating system
historical
intelligent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410431888.8A
Other languages
Chinese (zh)
Other versions
CN118051374B (en
Inventor
邓练兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Qinzhi Technology Research Institute Co ltd
Original Assignee
Guangdong Qinzhi Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Qinzhi Technology Research Institute Co ltd filed Critical Guangdong Qinzhi Technology Research Institute Co ltd
Priority to CN202410431888.8A priority Critical patent/CN118051374B/en
Publication of CN118051374A publication Critical patent/CN118051374A/en
Application granted granted Critical
Publication of CN118051374B publication Critical patent/CN118051374B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • G06F11/3062Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations where the monitored property is the power consumption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application belongs to the field of data processing, and particularly relates to an intelligent check point recovery method, a cloud operating system and a computing platform, wherein the method comprises the following steps: acquiring historical system monitoring data of an intelligent computing cloud operating system; performing predictive analysis on historical system monitoring data through a detection point predictive model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result; and carrying out real-time monitoring on the current system state of the intelligent computing cloud operating system through the real-time monitoring model, triggering a corresponding target detection point based on a real-time monitoring result, and carrying out recovery operation on the intelligent computing cloud operating system. According to the method, by combining historical system monitoring data, a detection point prediction model and a real-time monitoring model, intelligent system recovery operation is achieved, stability, reliability and data integrity of the system are improved, and the recovered intelligent computing cloud operating system is ensured to be in a normal running state and data integrity is maintained.

Description

Intelligent check point recovery method, cloud operating system and computing platform
Technical Field
The application belongs to the field of data processing, and particularly relates to an intelligent check point recovery method, a cloud operating system and a computing platform.
Background
In order to promote popularization of intelligent application in various industries and fields, construction of an intelligent computing platform and an assisted intelligent super computing center is urgently needed to be established, basic construction of an artificial intelligent platform is provided for scientific research, industry and urban service, and talent aggregation, industry upgrading and development are further achieved. Application containerization is a technique to package applications and all their dependencies into a separate, portable container. The containerization technique allows applications, libraries, configuration files, and other dependencies to be bundled together to ensure consistent operation in a variety of environments, improving deployment efficiency, portability, and flexibility, allowing developers to more easily manage and deploy applications.
In the related art, current computing systems often consist of a large number of components and services that are dependent upon and interact with each other. Meanwhile, the workload of the system may also vary with time, user demand, and the like. In such complex and dynamic environments, conventional fault recovery methods may not be effective for a variety of situations, and thus intelligent checkpoint recovery methods are needed to achieve more flexible and intelligent recovery operations.
Disclosure of Invention
The application provides an intelligent check point recovery method, a cloud operating system and a computing platform, which are used for effectively improving the performance and resource utilization rate of the intelligent computing cloud operating system, meeting the execution requirements of different types of services and improving the performance, flexibility and expansibility of the system.
In a first aspect, the present application provides an intelligent checkpoint recovery method, applied to an intelligent computing cloud operating system, where the intelligent computing cloud operating system is an operating system adapted to a cloud computing environment; the intelligent check point recovery method comprises the following steps:
Acquiring historical system monitoring data of the intelligent computing cloud operating system; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency;
Performing predictive analysis on the historical system monitoring data through a detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result; the detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the establishment is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system;
And carrying out real-time monitoring on the current system state of the intelligent computing cloud operating system through a real-time monitoring model, triggering a corresponding target detection point based on a real-time monitoring result, and carrying out recovery operation on the intelligent computing cloud operating system.
In a second aspect, an embodiment of the present application provides an intelligent computing cloud operating system, where the intelligent computing cloud operating system is an operating system adapted to a cloud computing environment; the intelligent computing cloud operating system includes:
a monitoring unit configured to obtain historical system monitoring data of the intelligent computing cloud operating system; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency;
The creating unit is configured to carry out predictive analysis on the historical system monitoring data through a detection point prediction model and create a target detection point matched with the intelligent computing cloud operating system based on an analysis result; the detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the establishment is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system;
The recovery unit is configured to monitor the current system state of the intelligent computing cloud operating system in real time through a real-time monitoring model, trigger a corresponding target detection point based on a real-time monitoring result, and perform recovery operation on the intelligent computing cloud operating system.
In a third aspect, an embodiment of the present application provides an intelligent computing platform, including:
At least one processor, memory, and input output unit;
Wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the intelligent checkpoint recovery method of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided that includes instructions that, when executed on a computer, cause the computer to perform the intelligent checkpoint recovery method of the first aspect.
According to the technical scheme provided by the embodiment of the application, the intelligent computing cloud operating system is an operating system which is adapted to a cloud computing environment. In the intelligent check point recovery scheme, firstly, historical system monitoring data of an intelligent computing cloud operating system are obtained; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency. And further, performing predictive analysis on the historical system monitoring data through a detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result. The detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the creation is completed is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system. Finally, the current system state of the intelligent computing cloud operating system is monitored in real time through the real-time monitoring model, and corresponding target detection points are triggered based on the real-time monitoring results to recover the intelligent computing cloud operating system. According to the technical scheme provided by the embodiment of the application, the intelligent system recovery operation is realized by combining the historical system monitoring data, the detection point prediction model and the real-time monitoring model, the stability, the reliability and the data integrity of the system are improved, the recovered intelligent computing cloud operating system is ensured to be in a normal running state, and the data integrity is maintained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an intelligent checkpoint recovery method in accordance with an embodiment of the present application;
FIG. 2 is a schematic diagram of an intelligent computing cloud operating system according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In order to promote popularization of intelligent application in various industries and fields, construction of an intelligent computing platform and an assisted intelligent super computing center is urgently needed to be established, basic construction of an artificial intelligent platform is provided for scientific research, industry and urban service, and talent aggregation, industry upgrading and development are further achieved. Application containerization is a technique to package applications and all their dependencies into a separate, portable container. The containerization technique allows applications, libraries, configuration files, and other dependencies to be bundled together to ensure consistent operation in a variety of environments, improving deployment efficiency, portability, and flexibility, allowing developers to more easily manage and deploy applications.
Cloud computing is an emerging computing model that provides on-demand computing resources and services over a network. The core idea of cloud computing is to distribute computing tasks over a large number of computer-made resource pools, enabling various applications to acquire computing power, storage space, and various software services as needed. Intelligent computing is a technology for simulating human intelligence, and the process of automatically completing complex tasks by a computer is realized by simulating the thinking mode and learning capacity of a human. Resource management techniques are techniques related to how to efficiently allocate and schedule system resources to meet user demands.
In the related art, current computing systems often consist of a large number of components and services that are dependent upon and interact with each other. Meanwhile, the workload of the system may also vary with time, user demand, and the like. In such complex and dynamic environments, conventional fault recovery methods may not be effective for a variety of situations, and thus intelligent checkpoint recovery methods are needed to achieve more flexible and intelligent recovery operations.
The embodiment of the application provides an intelligent check point recovery method, a cloud operating system and a computing platform.
In particular, the intelligent checkpoint recovery scheme can be applied to an intelligent computing cloud operating system. The intelligent computing cloud operating system is an operating system adapted to a cloud computing environment. In the intelligent check point recovery scheme, firstly, historical system monitoring data of an intelligent computing cloud operating system are obtained; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency. And further, performing predictive analysis on the historical system monitoring data through a detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result. Finally, the current system state of the intelligent computing cloud operating system is monitored in real time through the real-time monitoring model, and corresponding target detection points are triggered based on the real-time monitoring results to recover the intelligent computing cloud operating system.
In the intelligent check point recovery scheme, by means of historical system monitoring data and a check point prediction model, possible faults of the system can be predicted and analyzed, potential risk points can be recognized in advance, corresponding preventive measures are taken, the occurrence frequency of the faults of the system is reduced, and the stability and reliability of the system are improved. The real-time monitoring model can monitor the current state of the intelligent computing cloud operating system in real time, and trigger corresponding target detection points in time when abnormality or risk is found. Therefore, the quick response to the faults can be realized, recovery operation is timely adopted, the fault recovery time of the system is shortened, and the influence on the user is reduced. The target detection point created based on the analysis result can guide the recovery operation more accurately, so that the recovery process is more efficient and reliable. By combining the historical monitoring data and the real-time monitoring result, the recovery operation can be ensured to have strong pertinence, and the problems faced by the current system can be effectively solved. In the system recovery process, it is critical to ensure the integrity of the data. The system state is monitored in real time through the real-time monitoring model, the integrity of the data is ensured in the recovery operation, the risks of data loss and data inconsistency can be reduced to the greatest extent, and the safety and reliability of the user data are ensured.
In summary, by combining the historical system monitoring data, the detection point prediction model and the real-time monitoring model, the intelligent system recovery operation can be realized, the stability, the reliability and the data integrity of the system are improved, and better service experience is provided for users.
The intelligent check point recovery scheme provided by the embodiment of the application can be executed by a chip. The chip described herein may be various special purpose processors, including graphics processor (Graphics Processing Unit, GPU), machine learning processor (MACHINE LEARNING Unit, MLU), central processing Unit (Central Processing Unit, CPU), network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Further alternatively, the artificial intelligence chip and accelerator card designs may employ high performance MLUs as the base module of the intelligent platform. The MLU high-performance low-power-consumption artificial intelligent processor card adopts the latest architecture, the equivalent theoretical peak speed can reach 128 trillion fixed-point operations per second, typical board-level power consumption is only 80 watts, and the peak power consumption is not more than 110 watts. The high-performance artificial intelligent server can be built in a modularized manner based on the MLU, and different intelligent application loads can be flexibly handled.
The intelligent check point recovery scheme provided by the embodiment of the application can also be executed by electronic equipment, and the electronic equipment can be a server, a server cluster and a cloud server. The electronic device may also be a terminal device such as a cell phone, computer, tablet, wearable device, or a dedicated device (e.g., a dedicated terminal device with an intelligent checkpoint recovery system, etc.). The chips described in the above embodiments may be mounted on these electronic devices. Or the electronic devices may also install a service program for performing the intelligent checkpoint recovery scheme.
In the embodiment of the application, the intelligent computing cloud operating system is mainly responsible for storing various related data such as input data, computing results, observation data, visual data and the like of the advanced computing platform. The data may be from different applications and require unified management and storage for subsequent analysis and processing.
Fig. 1 is a schematic diagram of an intelligent checkpoint recovery method according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
101, acquiring historical system monitoring data of the intelligent computing cloud operating system.
In an embodiment of the present application, the historical system monitoring data at least includes: historical system running state, historical system load condition and historical fault occurrence frequency. In this embodiment, the historical system operating states include the system's operating time, operating state (e.g., normal, abnormal, shutdown, etc.), operating mode (e.g., working mode, backup mode, etc.). Such data may be obtained through system logs, monitoring software, or other means of running state logging. Knowing the historical system operating state can help analyze the stability and reliability of the system to find possible anomalies or problems. The historical system load condition relates to the use condition of system resources, such as the use rate of an artificial intelligent chip, the resource occupancy rate of a data processing task, the use rate of a CPU, the use rate of a memory, the I/O of a disk, the network traffic and the like. These data can help to evaluate the performance status and resource utilization of the system, identify system load peaks and resource bottlenecks, thereby optimizing system configuration and improving system performance. Historical failure frequency refers to the frequency and type of failure that the system has developed over a period of time. Such data typically includes fault type, time of occurrence, duration of fault, etc. By analyzing the historical fault data, the stability and reliability condition of the system can be known, the occurrence rule and cause of the fault can be identified, corresponding preventive measures can be taken, and the occurrence frequency of the fault can be reduced.
By collecting, analyzing and utilizing the historical system monitoring data, an administrator can be helped to better know the running condition, the resource utilization condition and the fault condition of the system, so that the system management and operation and maintenance strategy is optimized, and the stability, the reliability and the performance of the system are improved.
And 102, performing predictive analysis on the historical system monitoring data through a detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result.
In the embodiment of the application, the detection point prediction model is used for comprehensively constructing the target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area.
In an intelligent computing cloud operating system, a "checkpoint" (checkpoint) refers to a snapshot of a state of a system at a certain point in time during operation, and includes various state information and necessary data of the system. The process of creating checkpoints is typically to save a snapshot of the current system state to persistent storage for system recovery or failure handling when needed in the future. In the checkpoint predictive model, the results of predictive analysis based on historical system monitoring data can be used to determine when and under what conditions new checkpoints should be created to achieve better system stability and availability. This predictive analysis may include identifying potential failure modes, peak load periods, and the like.
Thus, a target checkpoint created by a checkpoint predictive model means that a new checkpoint is automatically created at an expected point in time or system state based on the predictive analysis results. The creation of this target checkpoint is based on the output of a predictive model so that when the system fails or needs to be restored, it can quickly roll back to the system state to which this checkpoint corresponds to reduce data loss and the time required for system restoration.
The checkpoints of the present application are particularly improved over the prior art as follows: first, the intelligent creation of checkpoints of the present application may utilize advanced predictive analysis techniques, perform predictive analysis based on historical system monitoring data, and intelligently determine the best timing and conditions to create checkpoints to maximize system stability and availability. Second, adaptivity. The check point provided by the application can have self-adaptability, and the creation strategy of the check point can be adjusted in real time according to the running state of the system and the environmental change, so that different running conditions and requirements can be better adapted. Third, accuracy and efficiency. By employing a more accurate predictive analysis model, checkpoints of the present application may be able to more accurately predict the likelihood of system failure occurrence or peak periods of resource loading, thereby making the created checkpoints more targeted and efficient. Fourth, data driven decision making. The check point of the application can adopt a decision method based on data driving, and dynamically adjusts the creation strategy of the check point according to real-time system monitoring data and predictive analysis results so as to realize effective management and protection of the system state. Fifth, integrated management. The check point of the application can be integrated with other functions of the intelligent computing cloud operating system, such as automatic operation and maintenance, self-healing capability and the like, thereby realizing comprehensive intelligent management and optimization of the system.
In summary, the inspection point of the application may bring about significant improvement to the existing inspection point through the characteristics of intelligence, self-adaption, precision, data driving and the like, thereby more effectively improving the stability, reliability and performance of the system.
For example, assume that the detection point prediction model: a strategy analysis layer, a detection probe layer and a detection creation layer. Based on the above-mentioned hypothesis structure, in 102, the prediction analysis is performed on the historical system monitoring data through a detection point prediction model, and a target detection point matched with the intelligent computing cloud operating system is created based on the analysis result, which may be implemented as the following steps:
201, performing predictive analysis on the historical system monitoring data through a strategy prediction layer to obtain a detection point creation strategy of a detection point; the detection point creation strategy at least comprises: creating conditions, creating opportunities and storing space positions;
202, judging whether the creation condition and/or the creation time set in the detection point creation strategy are reached or not through detecting the probe layer;
203, through a detection creation layer, after reaching the creation condition and/or the creation opportunity set in the detection point creation policy, constructing a corresponding target detection point based on the real-time monitoring data, and storing the target detection point into a storage space position set in the detection point creation policy.
Through steps 201 through 203, through a policy prediction layer (step 201), the system may perform predictive analysis based on historical monitoring data to determine an optimal checkpoint creation policy, including creation conditions, opportunities, and storage locations. This avoids unnecessary checkpoints creation, thereby optimizing the utilization of system resources. By detecting the probe layer (step 202), the system can monitor the system state in real time, judging whether the conditions and the time for creating the check point are satisfied. Once the conditions are met, the system may immediately build a target checkpoint at the detection creation layer (step 203), thereby reducing system downtime and improving availability of the system. Since checkpoint creation is based on predictive analysis and real-time monitoring data, changes in system state and potential failure modes can be captured more accurately. By periodically creating target checkpoints, the system can better cope with faults and recover quickly, thereby improving the reliability of the system. Since the checkpoint creation policy is determined from historical monitoring data and predictive analysis, the storage of checkpoints may be managed more efficiently. The system may choose to create checkpoints at key times while controlling the number and storage locations of checkpoints, thereby saving storage space.
Illustratively, assume that in step 201, predictive analysis indicates that the system is less likely to fail during the night low load period of each day and is more likely to fail during the high load period. Thus, the creation strategy may be set to create checkpoints during night low load periods to reduce impact on system performance and improve reliability of the system. In step 202, the detection probe layer may monitor the load condition of the system in real time and determine whether it is currently in a night low load period. If so, the system will proceed to step 203 and create a target checkpoint during the night low load period. Finally, in step 203, the system builds a target checkpoint based on the real-time monitoring data and stores it in a pre-specified storage location for future use when the system needs to be restored or rolled back.
In embodiments of the present application, creation conditions and/or creation opportunities generally refer to determining when to begin performing a particular operation or task in a system. In the above scenario, the creation conditions and/or creation opportunities may include the following considerations: 1. system operation status: the conditions under which the detection points are created may depend on the current operating state of the system. For example, the detection point creation operation may be performed during periods when the system is at low load to minimize the impact on system performance. 2. Data backup requirements: the timing of creating the detection points is typically related to the data backup requirements. If the data in the system needs to be backed up regularly or after a specific event occurs, the creation timing of the detection point can be determined according to the backup plan. 3. System stability: the stability of the system should be taken into account when creating the detection points. This is preferably performed while the system is in a steady state to ensure that the creation of the detection point does not cause a system crash or data loss. 4. Resource availability: creating detection points may require consuming system resources such as memory space and computing resources. Thus, the creation should ensure that the system has sufficient resources available to perform the operation. 5. Business requirements: the timing of creating the detection points may also depend on the traffic demand. For example, some business processes may require that detection points be created at specific points in time to ensure consistency and integrity of the data. 6. Event triggering: in some cases, the creation of a detection point may be triggered by a specific event, such as a system update, completion of an important data operation, etc. In this case, the opportunity to create a detection point is directly associated with the time of occurrence of the trigger event. The creation condition and/or the creation time are/is determined according to the factors such as the data backup requirement and the event trigger under the conditions that the system is stable, the resources are sufficient and the service requirement is met.
In the above steps, constructing the target detection point based on real-time monitoring data and storing it in the storage space position specified by the detection point creation policy is a key step, which ensures that the system can accurately restore to the previous state when needed. Specifically, first, the system needs to collect real-time monitoring data, which may include system status, performance metrics, storage space usage, and so forth. Such data may be obtained by way of system monitoring tools, sensors, log files, and the like. Further, based on the collected real-time monitoring data, the system may determine a point in time at which the target detection point was constructed. This is typically when the system is in steady state, good performance and high data consistency. The system builds the target detection point based on the state determined by the current monitoring data to ensure that this state can be accurately restored to in the future. The detection points need to be stored in a safe and reliable location to ensure that the system state can be restored quickly and reliably when required. Based on the storage space locations set in the detection point creation policy, the system selects an appropriate storage medium or storage system to store the target detection point data. The system converts the real-time monitoring data into target detection points according to the time points of constructing the target detection points and the selected storage positions, and stores the target detection points into the designated storage space positions. This may involve operations such as packing, compressing, and encrypting the data to ensure the integrity and security of the data. The created target detection point needs to be managed and maintained, including periodic checking, updating, backup and other operations. A system administrator needs to ensure the integrity and availability of the inspection point data so that the system state can be quickly and efficiently restored when needed. Through the above steps, the system can construct a target detection point based on the real-time monitoring data and store the target detection point in the designated storage space position, thereby ensuring that the system can quickly and reliably recover to the previous state when required.
Further optionally, in step 201, extracting, by a first feature extraction module, data storage behavior features corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data; the data storage behavior feature comprises at least: data throughput characteristics, access frequency characteristics and data redundancy requirement characteristics corresponding to each local network area. Further, through a second feature extraction module, extracting data security features corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data; the data security features include at least: data encryption characteristics and access control characteristics corresponding to each local network area. Therefore, comprehensive prediction is carried out on the data storage behavior characteristics and the data security characteristics corresponding to each local network area through the backup demand prediction module so as to obtain data backup demand information corresponding to each local network area; the data backup requirement information at least comprises: storage requirements, security requirements, task processing objectives, and data storage media types. And finally, generating data backup strategies corresponding to each local network area based on the data backup requirement information, and comprehensively constructing a strategy for creating the detection points.
In this case, by these additional feature extraction module and backup demand prediction module, the data backup demand of the system can be analyzed and predicted in more detail, so as to optimize the creation strategy of the checkpoint, and specific examples are as follows:
It is assumed that the first feature extraction module extracts data storage behavior features, such as data throughput, access frequency, and data redundancy requirements, for each local network region based on the historical monitoring data. For example, within a certain local network area, data throughput and access frequency may increase significantly over a particular period of the workday, while data redundancy requirements may increase as traffic increases.
It is assumed that the second feature extraction module extracts data security features, such as data encryption and access control, of each local network region based on the historical monitoring data. For example, a local network area may need to encrypt sensitive data and set strict access control policies.
By the backup demand prediction module, the system can comprehensively consider the data storage behavior characteristics and the data security characteristics to predict the data backup demand information of each local network area. For example, based on the high data throughput and access frequency of a local network region, and the characteristics of higher data security requirements, it may be predicted that the data backup requirements for that region are greater, requiring checkpoints to be created more frequently. Based on the data backup requirement information, the system may generate data backup policies for each local network region, including storage requirements, security requirements, task processing objectives, and data storage media types. For example, for areas with higher data throughput and access frequency, a faster and reliable storage medium may be selected and more frequent backup policies set. Finally, the system can comprehensively build the data backup strategies of each local network area into a check point creation strategy so as to guide the system to create check points in different areas according to the specific data backup requirements.
By the comprehensive data backup demand prediction and strategy generation method, the data backup demand can be predicted more accurately by the system according to the specific data storage behavior and safety demand of each local network area, unnecessary backup operation is avoided, and therefore resources are saved and efficiency is improved. Based on factors such as storage requirements, security requirements, task processing targets, storage medium types and the like, the system can generate an optimal data backup strategy, and the integrity, reliability and efficiency of data backup are ensured. Aiming at specific requirements of different local network areas, the system can customize and generate corresponding backup strategies, thereby realizing targeted management and optimization and improving the overall performance and reliability of the system. By implementing the accurate backup and the optimal backup strategy, the system can more effectively utilize the storage resources, avoid the waste of the resources and unnecessary cost expenditure, and further realize the optimal utilization of the resources. Through accurate backup and customized management, the system can better ensure the safety and the integrity of data, improve the reliability and the stability of the system, and reduce the possibility of data loss and risk occurrence. In summary, the adoption of the comprehensive data backup demand prediction and policy generation method can bring about more accurate, efficient and reliable data backup management, thereby optimizing the operation effect and resource utilization of the system.
As an optional embodiment, in the step above, based on the data backup requirement information, generating a data backup policy corresponding to each local network area, and after comprehensively constructing the data backup policy for the detection point, the following steps may be further executed to implement reliability assessment:
Acquiring historical reliability data of storage space positions in the detection point creation strategy through a reliability evaluation module; the historical reliability data includes at least: the historical failure rate, the historical data loss rate, the historical recovery time and the available recovery mechanism of the storage space position; based on the historical reliability data, performing reliability evaluation processing on the storage space position in the detection point creation strategy to obtain a corresponding reliability matrix; based on a preset dynamic risk affordable matrix, performing reliability screening on elements in the reliability matrix to obtain corresponding target elements; the target element accords with a preset reliability evaluation qualification condition; and outputting the target element as a storage space position qualified in reliability evaluation.
After the steps are executed, the reliability evaluation step can be executed to further improve the stability and reliability of the system, and the system can comprehensively evaluate the historical performance of the storage space by acquiring the historical reliability data of the storage space position, including the historical failure rate, the data loss rate, the recovery time and the like. This helps to understand the performance of the storage space in past operation, providing an important reference for evaluating the current reliability. Based on the historical reliability data, the system can generate a corresponding reliability matrix to clearly present the reliability information of different storage space positions. Such a matrix may help the decision maker to better understand the reliability of the storage space and provide basis for subsequent evaluations. By screening elements in the reliability matrix and combining a preset dynamic risk affordable matrix, the system can determine which storage space positions have higher reliability and meet preset reliability evaluation qualification conditions. This ensures that the selected storage space meets the reliability requirements of the system. Finally, according to the reliability evaluation result, the system can output the storage space position qualified by the reliability evaluation. These locations are severely evaluated and screened, with a high reliability, and can be used as ideal storage locations for creating checkpoints. This helps to increase the confidence of the system in backing up data and ensures the security and reliability of the backed up data.
For example, assume that in a historical reliability data analysis, the failure rate of a storage space location over the past year is found to be very low, the data loss rate is also low, and the recovery time at the time of failure is short. Based on the data, the system generates a corresponding reliability matrix, performs reliability screening, and determines that the storage space position meets the preset reliability evaluation qualification condition. Thus, the storage space location is output as a location that is eligible for reliability assessment, and can be an ideal storage location for creating a checkpoint. Therefore, the confidence of the system in the data backup process can be improved, and the safety and reliability of the backup data are ensured.
In an actual application, further optionally, reliability prediction scores of each element in the reliability matrix represent reliability evaluation scores corresponding to storage space positions corresponding to each detection point creation strategy under a reliability evaluation dimension.
Wherein the reliability prediction value of the element i in the reliability matrixThe method is calculated by adopting the following formula:
wherein, Historical failure rate representing ith storage space location,/>Historical data loss rate representing the i-th storage space location,/>Historical recovery time representing the i-th storage space location,/>Available recovery mechanism data value representing the ith storage space location,/>、/>、/>、/>Representing the weight coefficient corresponding to each of the four dimensions,/>Reliability assessment score of ith storage space position under kth external system factor dimension,/>The weight coefficient corresponding to the kth external system factor is at least: traffic, data volume, geographic location where the storage device is located, network connection quality, network provider reputation.
Based on the above formula, further, assuming that the target element meets a preset reliability evaluation qualification condition, it is: reliability prediction scores of element i in the reliability matrixAbove the corresponding reference values in the dynamic risk affordable matrix. And the corresponding reference value in the dynamic risk bearable matrix is dynamically adjusted along with the change trend of the actual storage scheme of the intelligent computing cloud operating system.
Therefore, the reference value in the dynamic risk affordable matrix is dynamically adjusted along with the change trend of the actual storage scheme of the intelligent computing cloud operating system, and the system can realize real-time adaptability. This means that the system can flexibly adjust the reliability evaluation standard according to the current system configuration and running state, and ensure that the evaluation result accords with the actual condition of the current environment. By dynamically adjusting the reference values according to the trend of the actual storage scheme, the system can more accurately predict the reliability of the storage space. This helps to discover potential reliability problems in advance and take preventive maintenance measures to ensure that the system is continuously and stably serviced in future operations. Dynamically adjusting reliability evaluation criteria can ensure that the system avoids excessive or insufficient resource allocation in consideration of current system state and configuration conditions when evaluating storage space reliability. This helps to optimize resource utilization and improve system performance and efficiency. By adjusting the reliability evaluation criteria in time, the system can better cope with risks brought by system configuration and environmental changes. This helps to reduce the risk of data loss and system failure due to storage space reliability issues, improving system stability and reliability.
For example, assuming that the intelligent computing cloud operating system has updated the storage architecture over a period of time, new storage devices and technologies are introduced. In this case, the reference values in the dynamic risk affordable matrix will be adjusted according to the characteristics and performance of the new storage scheme. If the new storage device has higher reliability and performance, the system will correspondingly increase the criteria for reliability assessment, requiring that the reliability predictive score for the storage space location also increase. Thus, the reliability standard of the system can be ensured to be adapted to the current storage environment, and the stability and reliability of the system are improved.
In an optional embodiment, in the step above, based on a preset dynamic risk affordable matrix, after performing reliability screening on elements in the reliability matrix to obtain corresponding target elements, the corresponding storage space positions in the target elements may be further arranged according to the reliability prediction value from high to low. Specifically, a target storage space position with a first priority in a strategy is created for the detection point according to the storage space position with a first preset bit number; creating a backup storage space position of a second priority in a strategy for the detection point by using the storage space position of a second preset bit; and creating an extended storage space position with a third priority in the strategy for the detection point by using the storage space position with the third preset bit number.
Wherein the probabilities of the storage space positions corresponding to the first priority, the second priority and the third priority are arranged from high to low. The number of preset orders of each level is flexibly set according to the data volume of the intelligent computing cloud operating system. Of course, such ranking levels are merely examples and are not intended to be limiting.
In the above embodiment, the elements in the reliability matrix are screened by the preset dynamic risk affordable matrix to obtain the target elements, and the target elements are ranked according to the reliability prediction scores of the corresponding storage space positions. These memory locations are then assigned to different priorities in the detection point creation policy according to a preset number of bits to ensure a fast and reliable recovery to the previous state in the event of a system failure.
In a specific implementation, the system is preset with a dynamic risk affordable matrix for evaluating the reliability of the storage space position. This matrix may evaluate storage locations based on a variety of factors, such as storage device type, geographic location, network connection, etc., and generate reliability prediction scores. And screening the elements in the reliability matrix according to the evaluation result of the dynamic risk affordable matrix to obtain target elements, namely the storage space position with high reliability. The corresponding memory locations in the target elements are ranked according to their reliability prediction scores from high to low to determine their priorities. And according to the preset order, allocating the ordered storage space positions to different priorities in the detection point creation strategy. For example, the most reliable storage location is assigned to the first priority, and so on. The number of preset bit times of each priority can be flexibly set according to the data volume and the requirement of the system. This ensures that the system can efficiently manage and utilize storage resources at different scales and loads.
By such an embodiment, the system may intelligently select and allocate storage locations based on the reliability prediction scores of the storage locations, such that the detection point creation policy may revert to the most reliable state in order of priority when the system fails. Here, by preferentially selecting a storage space location with high reliability, the system can more quickly and reliably recover to a previous state when a failure occurs, reducing service interruption time and risk of data loss. The allocation is performed according to the reliability of the storage location, so that reliable storage resources can be utilized to the greatest extent, and meanwhile, important data is prevented from being stored in an unreliable location, thereby improving the overall storage efficiency. Through the dynamic risk bearable matrix and the quantity elastic setting, the system can intelligently adjust the distribution of the storage positions according to actual conditions, adapt to different data scales and load demands, and keep the stability and reliability of the system.
In the embodiment of the present application, further optionally, the configuration information of the target detection point after the creation is dynamically adjusted along with the change of the real-time task running state in the intelligent computing cloud operating system.
Specifically, first, the system continuously monitors real-time task running states in the intelligent computing cloud operating system, including indexes such as task queue length, task execution time, task success rate and the like. These metrics reflect the workload situation and task execution efficiency of the system in the current task. Furthermore, the system dynamically adjusts the number and positions of the target detection points according to the change of the running state of the real-time task. For example, if the task queue length suddenly increases, the system may increase the number of detection points to improve the accuracy of monitoring the task execution efficiency. If certain tasks fail frequently, the system may adjust the location of the detection points to place them on critical modules or computing resources associated with those tasks to enhance the ability to repair abnormal situations. Further, the strategy of adjusting the number and the position of the target detection points needs to be optimized according to the actual situation. A set of adaptive adjustment strategies may be designed based on the characteristics and operational modes of the system, such as threshold-based adjustment, trend analysis based on historical data, and the like. Meanwhile, system resources and performance overhead need to be considered, so that the adjustment process cannot cause excessive burden on the system. The system needs to monitor the performance and effect of the target detection point in real time, and feed back and adjust the performance and effect according to actual conditions. If some detection points are found to be unsuitable or too many or too few, the system should be adjusted in time to maintain the accuracy and effectiveness of the detection. In order to realize automatic adjustment, the system can design a corresponding algorithm and a dynamic adjustment mechanism, and automatically trigger the adjustment of the number and the positions of the target detection points while monitoring the state change of the system. Therefore, manual intervention can be reduced, and the automation degree and response speed of the system are improved.
For example, to implement the function of automatically adjusting the number and position of target detection points, the system needs to monitor various key indicators of the intelligent computing cloud operating system, such as resource utilization, task queue length, task execution time, etc., in real time, and collect these data for subsequent analysis and decision. When a change in system state is detected, the system needs to perform anomaly detection and analysis to determine whether the number and position of target detection points need to be adjusted.
Illustratively, a dynamic detection point configuration model is constructed based on various key indexes such as resource utilization rate, task queue length, task execution time and the like in the intelligent computing cloud operating system, and the model comprises a generator and a discriminator. The generator is responsible for generating data samples, and the discriminator is responsible for judging whether the generated samples are true or not. Training the dynamic detection point configuration model using the training data. During the training process, the generator attempts to generate samples that are similar to the real data, while the arbiter attempts to distinguish between the generated samples and the real data. By means of countermeasure training, the capability of the generator for generating real samples is gradually improved, and the capability of the discriminator for correctly distinguishing the real samples from the generated samples is improved. During the training process, the dynamic detection point configuration model learns potential representations of the data. These potential representations may better capture the distribution characteristics of the data, which is very useful for anomaly detection. And performing anomaly detection by using the trained dynamic detection point configuration model. For a new data sample, its corresponding potential representation is first generated by a generator and then judged for its authenticity using a arbiter. If the generated sample is deemed unrealistic by the arbiter, it may be an anomaly. And evaluating and optimizing the abnormal detection result. The performance of the model can be evaluated by adopting indexes such as cross-validation, ROC curve, accuracy, recall rate and the like, and the model can be adjusted and optimized according to the requirement. And finally, deploying the trained model into an intelligent computing cloud operating system, monitoring the running state of a real-time task in the intelligent computing cloud operating system in real time, and detecting the abnormality. Through the steps, the dynamic detection point configuration model can be utilized for abnormality detection, and the stability and the safety of the system are improved in practical application.
The dynamic detection point configuration model herein may have the following unique improvements in dynamically adjusting the target detection point configuration information: first, real-time data modeling capabilities. The dynamic detection point configuration model combines the generation of an antagonism network and variation inference, and can better model real-time change of data. The method enables the model to update understanding of data distribution in time when the running state of the real-time task changes, and adjusts the abnormality detection strategy accordingly. Second, adapt to complex environments. The dynamic detection point configuration model optimizes the variation inference process by introducing a discriminator, so that the modeling capability of data distribution is improved. This enables the model to better adapt to data changes in complex environments, such as system load fluctuations, external environmental changes, etc., so that anomaly detection can be performed more accurately. Third, flexibility and adjustability. The challenge training mechanism between the generator and the arbiter of the dynamic detection point configuration model gives the model a certain flexibility and adjustability. When the running state of the real-time task changes, the training parameters or the network structure of the model can be adjusted to adapt to new data distribution, so that the aim of dynamically adjusting the configuration information of the target detection point is fulfilled. Fourth, real-time updates of potential representations. The potential representation learned by the dynamic detection point configuration model can better capture the distribution characteristics of the data. When the running state of the real-time task changes, the model can update the learned potential representation in real time, so that accurate modeling of data distribution is maintained, and a more reliable basis is provided for dynamically adjusting the configuration information of the target detection point. In summary, the dynamic detection point configuration model has strong real-time data modeling capability, capability of adapting to complex environments, flexibility and adjustability and potential real-time updating capability in terms of dynamically adjusting the configuration information of the target detection point, so that the change of the running state of the real-time task can be more effectively dealt with, and the accuracy and the instantaneity of anomaly detection can be improved. Furthermore, according to the abnormal detection and analysis result of the real-time task running state, the system needs to make adjustment decision to determine how the number and positions of the target detection points should be adjusted. The process can be based on a preset adjustment strategy, and can also dynamically generate an adjustment scheme according to the data and the system state monitored in real time. Once the adjustment scheme is determined, the system needs to automatically perform the corresponding operations, including adding, deleting or moving the target detection point. This can be achieved by an automated execution mechanism built in the system, for example by writing a corresponding script or program to achieve automatic adjustment of the detection point.
After the adjustment is completed, the system needs to monitor the adjusted target detection point and feed back according to the actual effect. If the adjusted detection point is found to be still unsuitable, the system may trigger the adjustment process again until the desired effect is reached. The system needs to continuously optimize and improve an automatic adjustment mechanism, including improving an anomaly detection algorithm, optimizing an adjustment strategy, improving the efficiency of automatic execution and the like, so as to improve the automation degree and the response speed of the system. Through the design of the algorithm and the mechanism, the system can automatically trigger the adjustment of the number and the position of the target detection points when the state change of the system is monitored, so that the manual intervention is reduced, the automation degree and the response speed of the system are improved, and the intelligent check point recovery effect is further enhanced.
103, Monitoring the current system state of the intelligent computing cloud operating system in real time through a real-time monitoring model, triggering a corresponding target detection point based on a real-time monitoring result, and recovering the intelligent computing cloud operating system.
In an embodiment of the present application, a storage space location refers to a location for storing system checkpoint data. A system checkpoint is a snapshot of data that is periodically created while the system is operating properly for recovery operations when the system fails or is abnormal. By way of example, the storage space locations may be the following:
And (3) local storage: and storing the system check point data on a local disk or a solid state disk. This approach has the advantage of fast access, but has the disadvantage of limited data capacity and the possible risk of single point failure.
Network storage: system checkpoint data is stored in a network using a network storage device, such as a Network Attached Storage (NAS) or a Storage Area Network (SAN). This approach may provide greater storage space and higher reliability, but may increase data access latency.
Cloud storage: and storing the system check point data in the cloud by using a storage service provided by a cloud service provider. The method has the characteristics of high flexibility and strong expandability, but network delay and cloud service stability are required to be considered.
It can be appreciated that selecting the storage space location needs to take into account the requirements of the system for storage capacity, access speed, reliability and cost, and perform reasonable configuration and deployment according to practical situations. For example, a multi-level storage architecture is adopted to store critical system check point data in a storage device with higher reliability and larger capacity so as to ensure the reliability and efficiency of the system recovery operation.
As an alternative embodiment, it is assumed that the real-time monitoring model comprises at least: the device comprises a dynamic detection layer, a pre-judging layer, a triggering layer and a recovery layer. Based on this, 103, the current system state of the intelligent computing cloud operating system is monitored in real time through a real-time monitoring model, and a corresponding target detection point is triggered based on the real-time monitoring result, so that the recovery operation is performed on the intelligent computing cloud operating system, which may be implemented as the following steps:
301, monitoring the current system state of the intelligent computing cloud operating system in real time through a dynamic detection layer to obtain a dynamic monitoring state; the dynamic monitoring state at least comprises: resource utilization rate, task completion condition, data transmission condition and system operation load;
302, pre-judging the dynamic monitoring state through a pre-judging layer to select a target dynamic monitoring state with an abnormality at hand;
303, selecting an optimal detection point matched with the target dynamic monitoring state from the target detection points through a triggering layer; the data consistency between the recoverable data stored in the optimal detection point and the historical real-time monitoring state at the target time before the occurrence of the abnormality meets the preset recoverable condition;
304, performing recovery operation on the intelligent computing cloud operating system based on the optimal detection point through a recovery layer, so as to ensure that the recovered intelligent computing cloud operating system is in a normal running state and keep data integrity.
This embodiment describes a recovery system based on a real-time monitoring model that can predict and recover in time before detecting a system anomaly to ensure the normal operation and data integrity of the intelligent computing cloud operating system. In the dynamic detection layer (step 301), the current state of the intelligent computing cloud operating system is monitored in real time, and the current state comprises dynamic monitoring states such as resource utilization rate, task completion condition, data transmission condition, system running load and the like. In the pre-judgment layer (step 302), the target dynamic monitoring state of the impending abnormality is predicted based on the dynamic monitoring state, which helps to discover problems in advance and respond. In the trigger layer (step 303), optimal detection points matched with the predicted target dynamic monitoring state are selected from the target detection points, and the data stored in the detection points meet preset recoverable conditions, so that the data consistency is ensured. In the recovery layer (step 304), recovery operation is performed based on the optimal detection point, so that the intelligent computing cloud operating system is ensured to be in a normal running state after recovery, and data integrity is maintained.
Therefore, the system can predict the upcoming abnormal situation in advance through the pre-judgment of the prediction layer, so that a proper detection point is quickly selected for recovery operation, and the influence of system faults on the service is reduced. The trigger layer selects the optimal detection point meeting the preset recoverable condition, so that the consistency of the system data after recovery operation and the historical real-time monitoring state before abnormality occurs can be ensured, and the integrity of the data is ensured. The operation of the recovery layer can ensure that the system is in a normal running state after recovery, thereby enhancing the stability and reliability of the system and improving the service continuity. The real-time monitoring and recovering system can realize automatic operation and maintenance, lighten the workload of an administrator and improve the operation and maintenance efficiency and response speed. In general, the embodiment can effectively ensure the stable operation of the intelligent computing cloud operating system, improve the reliability and the recovery capability of the system, and provide powerful guarantee for service continuity and data integrity.
Illustratively, the following are specific examples of steps 301 through 304:
Step 301: dynamic detection layer
In the step, the system monitors the current state of the intelligent computing cloud operating system in real time, and acquires the dynamic monitoring state, wherein the dynamic monitoring state comprises information such as resource utilization rate, task completion condition, data transmission condition, system operation load and the like. This may be accomplished by various monitoring tools, sensors, or system logs.
In a specific example, resource utilization monitoring: the system continuously monitors the utilization rate of resources such as CPU, memory, disk, network and the like. Task completion condition monitoring: and checking the execution condition of the task in the current task queue to ensure that the task can be completed on time. And (3) monitoring data transmission conditions: and monitoring the data transmission speed and the transmission success rate, and ensuring the timeliness and the integrity of the data. Monitoring system operation load: and evaluating the load condition of the system, including request processing speed, concurrent connection number and the like.
Step 302: prejudging layer
In this step, the system performs a pre-determination according to the dynamic monitoring state to select a target dynamic monitoring state in which an abnormality is about to occur.
In a specific example, if the system finds that CPU utilization continues to be above a threshold, a performance problem or resource bottleneck may occur in the prognosis. If a task in the task queue starts backlog and the processing speed drops, the pre-determination may cause a task delay or failure. If the data transmission speed suddenly drops, the pre-judgment may occur a network failure or a data transmission jam. If the system operating load exceeds a preset threshold, the pre-determination may result in a system crash or service unavailability.
Step 303: trigger layer
In this step, the system selects an optimal detection point matching the target dynamic monitoring state from among the target detection points, and ensures that the data stored in the selected detection point satisfies a preset recoverable condition.
In a specific example, according to the pre-determined abnormal condition, the system selects a detection point closest to the target dynamic monitoring state from the historical detection points. Ensuring that the data stored in the selected detection point meets recovery conditions such as data consistency, availability, etc.
Step 304: recovery layer
In the step, the system performs recovery operation based on the selected optimal detection point, ensures that the intelligent computing cloud operating system is in a normal running state after recovery, and maintains data integrity.
In a specific example, if the performance problem is predicted, the system may choose to revert to a history state with better performance, possibly when the load is low or the resource utilization is low. If the task is delayed or failed, the system can select to recover the history state of successful completion of the task, which may be when the task queue is empty or the task success rate is high. If it is predicted that a network failure or data transmission congestion occurs, the system may choose to revert to a history of network patency, perhaps when the data transmission speed is normal and non-blocking. If a system crash is predicted to occur or a service is not available, the system may choose to revert to a historical state of stable operation of the system, perhaps when the system is under normal load and the service is available.
By the aid of the example of the steps, the system can timely predict the abnormality and perform recovery operation, and normal operation and data integrity of the intelligent computing cloud operating system are guaranteed.
In the embodiment of the application, the intelligent check point recovery method realizes intelligent system recovery operation by combining the historical system monitoring data, the check point prediction model and the real-time monitoring model, improves the stability, the reliability and the data integrity of the system, ensures that the recovered intelligent computing cloud operating system is in a normal running state and keeps the data integrity.
In yet another embodiment of the present application, there is also provided an intelligent computing cloud operating system, the intelligent computing cloud operating system being an operating system adapted to a cloud computing environment; as described with reference to fig. 3, the intelligent computing cloud operating system includes:
a monitoring unit configured to obtain historical system monitoring data of the intelligent computing cloud operating system; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency;
The creating unit is configured to carry out predictive analysis on the historical system monitoring data through a detection point prediction model and create a target detection point matched with the intelligent computing cloud operating system based on an analysis result; the detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the establishment is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system;
The recovery unit is configured to monitor the current system state of the intelligent computing cloud operating system in real time through a real-time monitoring model, trigger a corresponding target detection point based on a real-time monitoring result, and perform recovery operation on the intelligent computing cloud operating system.
Further optionally, the detection point prediction model: the system comprises a strategy analysis layer, a detection probe layer and a detection creation layer;
the detection unit is used for carrying out predictive analysis on the historical system monitoring data through a detection point prediction model, creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result and is configured to:
performing predictive analysis on the historical system monitoring data through a strategy prediction layer to obtain a detection point creation strategy of a detection point; the detection point creation strategy at least comprises: creating conditions, creating opportunities and storing space positions;
judging whether the creation conditions and/or the creation time set in the detection point creation strategy are reached or not through the detection probe layer;
and through a detection creation layer, after the creation condition and/or the creation time set in the detection point creation strategy are reached, constructing a corresponding target detection point based on the real-time monitoring data, and storing the target detection point into a storage space position set in the detection point creation strategy.
Further optionally, the detecting unit performs prediction analysis on the historical system monitoring data through a policy prediction layer to obtain a detection point creating policy of the detection point, and is configured to:
extracting data storage behavior characteristics corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data through a first characteristic extraction module; the data storage behavior feature comprises at least: data throughput characteristics, access frequency characteristics and data redundancy requirement characteristics corresponding to each local network area;
Extracting data security features corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data through a second feature extraction module; the data security features include at least: data encryption characteristics and access control characteristics corresponding to each local network area;
Comprehensively predicting the data storage behavior characteristics and the data security characteristics corresponding to each local network area through a backup demand prediction module so as to obtain data backup demand information corresponding to each local network area; the data backup requirement information at least comprises: storage requirements, security requirements, task processing objectives, and data storage media types;
And generating data backup strategies corresponding to each local network area based on the data backup demand information, and comprehensively constructing a strategy for the detection point.
Further optionally, the detecting unit generates a data backup policy corresponding to each local network area based on the data backup requirement information, and after comprehensively constructing the creating policy for the detection point, is further configured to:
Acquiring historical reliability data of storage space positions in the detection point creation strategy through a reliability evaluation module; the historical reliability data includes at least: the historical failure rate, the historical data loss rate, the historical recovery time and the available recovery mechanism of the storage space position;
Based on the historical reliability data, performing reliability evaluation processing on the storage space position in the detection point creation strategy to obtain a corresponding reliability matrix;
Based on a preset dynamic risk affordable matrix, performing reliability screening on elements in the reliability matrix to obtain corresponding target elements; the target element accords with a preset reliability evaluation qualification condition;
And outputting the target element as a storage space position qualified in reliability evaluation.
Further optionally, the reliability prediction scores of the elements in the reliability matrix represent reliability evaluation scores corresponding to the storage space positions corresponding to the detection point creation strategies under the reliability evaluation dimension;
wherein the reliability prediction value of the element i in the reliability matrix The method is calculated by adopting the following formula:
wherein, Historical failure rate representing ith storage space location,/>Historical data loss rate representing the i-th storage space location,/>Historical recovery time representing the i-th storage space location,/>Available recovery mechanism data value representing the ith storage space location,/>、/>、/>、/>Representing the weight coefficient corresponding to each of the four dimensions,/>Reliability assessment score of ith storage space position under kth external system factor dimension,/>The weight coefficient corresponding to the kth external system factor is at least: traffic, data volume, geographic location where the storage device is located, network connection quality, network provider reputation.
Further optionally, the target element meets a preset reliability evaluation qualification condition that:
reliability prediction scores of element i in the reliability matrix Higher than the corresponding reference value in the dynamic risk affordable matrix;
and dynamically adjusting the corresponding reference value in the dynamic risk affordable matrix along with the change trend of the actual storage scheme of the intelligent computing cloud operating system.
Further optionally, the detecting unit, based on a preset dynamic risk affordable matrix, performs reliability screening on elements in the reliability matrix to obtain corresponding target elements, and is further configured to:
arranging the corresponding storage space positions in the target elements according to the reliability prediction values from high to low;
creating a target storage space position of a first priority in a strategy for the detection point according to the storage space position of a first preset bit;
creating a backup storage space position of a second priority in a strategy for the detection point by using the storage space position of a second preset bit;
creating an extended storage space position of a third priority in a strategy for the detection point by using the storage space position of a third preset bit;
The probability of being selected of the storage space positions corresponding to the first priority, the second priority and the third priority is arranged from high to low; the number of preset orders of each level is flexibly set according to the data volume of the intelligent computing cloud operating system.
Further optionally, the real-time monitoring model at least includes: the device comprises a dynamic detection layer, a pre-judging layer, a triggering layer and a recovery layer;
the recovery unit is configured to monitor the current system state of the intelligent computing cloud operating system in real time through a real-time monitoring model, trigger a corresponding target detection point based on a real-time monitoring result, perform recovery operation on the intelligent computing cloud operating system, and perform recovery operation on the intelligent computing cloud operating system, wherein the recovery unit is configured to:
The current system state of the intelligent computing cloud operating system is monitored in real time through a dynamic detection layer, so that a dynamic monitoring state is obtained; the dynamic monitoring state at least comprises: resource utilization rate, task completion condition, data transmission condition and system operation load;
pre-judging the dynamic monitoring state through a pre-judging layer to select a target dynamic monitoring state with an abnormality at hand;
selecting an optimal detection point matched with the target dynamic monitoring state from the target detection points through a triggering layer; the data consistency between the recoverable data stored in the optimal detection point and the historical real-time monitoring state at the target time before the occurrence of the abnormality meets the preset recoverable condition;
And carrying out recovery operation on the intelligent computing cloud operating system based on the optimal detection point through a recovery layer so as to ensure that the recovered intelligent computing cloud operating system is in a normal running state and keep data integrity.
In the embodiment of the application, the intelligent check point recovery device realizes intelligent system recovery operation by combining the historical system monitoring data, the check point prediction model and the real-time monitoring model, improves the stability, the reliability and the data integrity of the system, ensures that the recovered intelligent computing cloud operating system is in a normal running state and keeps the data integrity.
In yet another embodiment of the present application, there is also provided an intelligent computing platform, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
A memory for storing a computer program;
and the processor is used for realizing the intelligent check point recovery method according to the embodiment of the method when executing the program stored in the memory.
The communication bus 1140 referred to above for electronic devices may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like.
Illustratively, it is assumed that a large-scale, autonomously controllable intelligent computing platform based on a neural network dedicated chip needs to be built for providing a hardware basis for developing and building the intelligent computing platform. Meanwhile, the intelligent computing platform can also provide a hardware foundation for the construction of an intelligent supercomputer center, and the construction of the center can be used for artificial intelligent platforms for scientific research, industry and urban service, and gathering talents and developing industry.
Specifically, the intelligent computing platform mainly comprises: the intelligent computing cloud system comprises an intelligent hardware platform, an intelligent computing cloud operating system, application environment development, a big data platform and an intelligent application PaaS platform. In the intelligent hardware platform, based on the intelligent computing theory, the deep learning chip, the AI intelligent accelerator card and the distributed server can be integrated into the intelligent hardware platform, so that basic hardware support is provided for the whole super computing platform and related derivative platforms, and the main content of the intelligent hardware platform comprises the following four parts: the intelligent computing system comprises an intelligent computing subsystem, a data storage subsystem, an intelligent computing cloud operating system and a support management subsystem.
The embodiment of the application provides an intelligent check point recovery method for constructing a low-energy-consumption arithmetic unit.
For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices described above.
Memory 1130 may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatil ememory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor 1110 may be various special purpose processors, including graphics processor (Graphics Processing Unit, GPU), machine learning processor (MACHINE LEARNING Unit, MLU), central processor (Central Processing Unit, CPU), network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Accordingly, the present application also provides a computer readable storage medium storing a computer program, where the computer program is executed to implement the steps executable by the electronic device in the above method embodiments.

Claims (10)

1. The intelligent check point recovery method is characterized by being applied to an intelligent computing cloud operating system, wherein the intelligent computing cloud operating system is an operating system which is adaptive to a cloud computing environment; the intelligent check point recovery method comprises the following steps:
Acquiring historical system monitoring data of the intelligent computing cloud operating system; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency;
Performing predictive analysis on the historical system monitoring data through a detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result; the detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the establishment is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system;
And carrying out real-time monitoring on the current system state of the intelligent computing cloud operating system through a real-time monitoring model, triggering a corresponding target detection point based on a real-time monitoring result, and carrying out recovery operation on the intelligent computing cloud operating system.
2. The intelligent checkpoint recovery method of claim 1, wherein the checkpoint prediction model: the system comprises a strategy analysis layer, a detection probe layer and a detection creation layer;
the method for predicting and analyzing the historical system monitoring data through the detection point prediction model, and creating a target detection point matched with the intelligent computing cloud operating system based on an analysis result comprises the following steps:
performing predictive analysis on the historical system monitoring data through a strategy prediction layer to obtain a detection point creation strategy of a detection point; the detection point creation strategy at least comprises: creating conditions, creating opportunities and storing space positions;
judging whether the creation conditions and/or the creation time set in the detection point creation strategy are reached or not through the detection probe layer;
and through a detection creation layer, after the creation condition and/or the creation time set in the detection point creation strategy are reached, constructing a corresponding target detection point based on the real-time monitoring data, and storing the target detection point into a storage space position set in the detection point creation strategy.
3. The intelligent checkpoint recovery method according to claim 2, wherein the performing, by a policy prediction layer, predictive analysis on the historical system monitoring data to obtain a checkpoint creation policy for a checkpoint comprises:
extracting data storage behavior characteristics corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data through a first characteristic extraction module; the data storage behavior feature comprises at least: data throughput characteristics, access frequency characteristics and data redundancy requirement characteristics corresponding to each local network area;
Extracting data security features corresponding to each local network area in the intelligent computing cloud operating system from the historical system monitoring data through a second feature extraction module; the data security features include at least: data encryption characteristics and access control characteristics corresponding to each local network area;
Comprehensively predicting the data storage behavior characteristics and the data security characteristics corresponding to each local network area through a backup demand prediction module so as to obtain data backup demand information corresponding to each local network area; the data backup requirement information at least comprises: storage requirements, security requirements, task processing objectives, and data storage media types;
And generating data backup strategies corresponding to each local network area based on the data backup demand information, and comprehensively constructing a strategy for the detection point.
4. The method for intelligent checkpoint recovery according to claim 3, wherein generating a data backup policy corresponding to each local network area based on the data backup requirement information, and comprehensively constructing a policy for creating the detection point, further comprises:
Acquiring historical reliability data of storage space positions in the detection point creation strategy through a reliability evaluation module; the historical reliability data includes at least: the historical failure rate, the historical data loss rate, the historical recovery time and the available recovery mechanism of the storage space position;
Based on the historical reliability data, performing reliability evaluation processing on the storage space position in the detection point creation strategy to obtain a corresponding reliability matrix;
Based on a preset dynamic risk affordable matrix, performing reliability screening on elements in the reliability matrix to obtain corresponding target elements; the target element accords with a preset reliability evaluation qualification condition;
And outputting the target element as a storage space position qualified in reliability evaluation.
5. The method according to claim 4, wherein the reliability prediction scores of the elements in the reliability matrix represent reliability evaluation scores corresponding to the storage space positions corresponding to the detection point creation strategies under the reliability evaluation dimension;
wherein the reliability prediction value of the element i in the reliability matrix The method is calculated by adopting the following formula:
wherein, Historical failure rate representing ith storage space location,/>Historical data loss rate representing the i-th storage space location,/>Historical recovery time representing the i-th storage space location,/>Available recovery mechanism data value representing the ith storage space location,/>、/>、/>、/>Representing the weight coefficient corresponding to each of the four dimensions,/>Reliability assessment score of ith storage space position under kth external system factor dimension,/>The weight coefficient corresponding to the kth external system factor is at least: traffic, data volume, geographic location where the storage device is located, network connection quality, network provider reputation.
6. The intelligent checkpoint recovery method according to claim 5, wherein the target element meets a preset reliability evaluation qualification condition that:
reliability prediction scores of element i in the reliability matrix Higher than the corresponding reference value in the dynamic risk affordable matrix;
and dynamically adjusting the corresponding reference value in the dynamic risk affordable matrix along with the change trend of the actual storage scheme of the intelligent computing cloud operating system.
7. The method for intelligent checkpoint recovery according to claim 4, wherein after performing reliability screening on the elements in the reliability matrix based on the preset dynamic risk affordable matrix to obtain the corresponding target elements, the method further comprises:
arranging the corresponding storage space positions in the target elements according to the reliability prediction values from high to low;
creating a target storage space position of a first priority in a strategy for the detection point according to the storage space position of a first preset bit;
creating a backup storage space position of a second priority in a strategy for the detection point by using the storage space position of a second preset bit;
creating an extended storage space position of a third priority in a strategy for the detection point by using the storage space position of a third preset bit;
The probability of being selected of the storage space positions corresponding to the first priority, the second priority and the third priority is arranged from high to low; the number of preset orders of each level is flexibly set according to the data volume of the intelligent computing cloud operating system.
8. The intelligent checkpoint recovery method of claim 1, wherein the real-time monitoring model comprises at least: the device comprises a dynamic detection layer, a pre-judging layer, a triggering layer and a recovery layer;
The method for monitoring the current system state of the intelligent computing cloud operating system in real time through the real-time monitoring model, triggering a corresponding target detection point based on a real-time monitoring result, and recovering the intelligent computing cloud operating system comprises the following steps:
The current system state of the intelligent computing cloud operating system is monitored in real time through a dynamic detection layer, so that a dynamic monitoring state is obtained; the dynamic monitoring state at least comprises: resource utilization rate, task completion condition, data transmission condition and system operation load;
pre-judging the dynamic monitoring state through a pre-judging layer to select a target dynamic monitoring state with an abnormality at hand;
selecting an optimal detection point matched with the target dynamic monitoring state from the target detection points through a triggering layer; the data consistency between the recoverable data stored in the optimal detection point and the historical real-time monitoring state at the target time before the occurrence of the abnormality meets the preset recoverable condition;
And carrying out recovery operation on the intelligent computing cloud operating system based on the optimal detection point through a recovery layer so as to ensure that the recovered intelligent computing cloud operating system is in a normal running state and keep data integrity.
9. An intelligent computing cloud operating system, wherein the intelligent computing cloud operating system is an operating system adapted to a cloud computing environment; the intelligent computing cloud operating system includes:
a monitoring unit configured to obtain historical system monitoring data of the intelligent computing cloud operating system; the historical system monitoring data at least comprises: historical system running state, historical system load condition and historical fault occurrence frequency;
The creating unit is configured to carry out predictive analysis on the historical system monitoring data through a detection point prediction model and create a target detection point matched with the intelligent computing cloud operating system based on an analysis result; the detection point prediction model is used for comprehensively constructing a target detection point which is suitable for the whole intelligent computing cloud operating system based on multidimensional cross prediction analysis of each local network area; the configuration information of the target detection point after the establishment is dynamically adjusted along with the change of the running state of the real-time task in the intelligent computing cloud operating system;
The recovery unit is configured to monitor the current system state of the intelligent computing cloud operating system in real time through a real-time monitoring model, trigger a corresponding target detection point based on a real-time monitoring result, and perform recovery operation on the intelligent computing cloud operating system.
10. An intelligent computing platform, the intelligent computing platform comprising:
At least one processor, memory, and input output unit;
Wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the intelligent checkpoint recovery method as in any of claims 1-8.
CN202410431888.8A 2024-04-11 2024-04-11 Intelligent check point recovery method, cloud operating system and computing platform Active CN118051374B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410431888.8A CN118051374B (en) 2024-04-11 2024-04-11 Intelligent check point recovery method, cloud operating system and computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410431888.8A CN118051374B (en) 2024-04-11 2024-04-11 Intelligent check point recovery method, cloud operating system and computing platform

Publications (2)

Publication Number Publication Date
CN118051374A true CN118051374A (en) 2024-05-17
CN118051374B CN118051374B (en) 2024-08-06

Family

ID=91052133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410431888.8A Active CN118051374B (en) 2024-04-11 2024-04-11 Intelligent check point recovery method, cloud operating system and computing platform

Country Status (1)

Country Link
CN (1) CN118051374B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118331779A (en) * 2024-06-12 2024-07-12 广东琴智科技研究院有限公司 Distributed system fault judging and recovering method, cloud operating system and computing platform applying method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143142A (en) * 2019-12-26 2020-05-12 江南大学 Universal check point and rollback recovery method
CN114518974A (en) * 2022-02-21 2022-05-20 中国农业银行股份有限公司 Checkpoint recovery method, device, equipment and medium for data processing task
CN116361060A (en) * 2023-05-25 2023-06-30 中国地质大学(北京) Multi-feature-aware stream computing system fault tolerance method and system
CN117540321A (en) * 2023-12-04 2024-02-09 国家电网有限公司大数据中心 Data service monitoring system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143142A (en) * 2019-12-26 2020-05-12 江南大学 Universal check point and rollback recovery method
CN114518974A (en) * 2022-02-21 2022-05-20 中国农业银行股份有限公司 Checkpoint recovery method, device, equipment and medium for data processing task
CN116361060A (en) * 2023-05-25 2023-06-30 中国地质大学(北京) Multi-feature-aware stream computing system fault tolerance method and system
CN117540321A (en) * 2023-12-04 2024-02-09 国家电网有限公司大数据中心 Data service monitoring system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118331779A (en) * 2024-06-12 2024-07-12 广东琴智科技研究院有限公司 Distributed system fault judging and recovering method, cloud operating system and computing platform applying method
CN118331779B (en) * 2024-06-12 2024-09-10 广东琴智科技研究院有限公司 Distributed system fault judging and recovering method, cloud operating system and computing platform applying method

Also Published As

Publication number Publication date
CN118051374B (en) 2024-08-06

Similar Documents

Publication Publication Date Title
Bharany et al. Energy efficient fault tolerance techniques in green cloud computing: A systematic survey and taxonomy
CN118051374B (en) Intelligent check point recovery method, cloud operating system and computing platform
CN102449603B (en) Server control program, control server, virtual server distribution method
EP3844618B1 (en) Orchestration of containerized applications
AU2013207906B2 (en) Fault tolerance for complex distributed computing operations
US20140122546A1 (en) Tuning for distributed data storage and processing systems
US10339131B1 (en) Fault prevention
KR102139058B1 (en) Cloud computing system for zero client device using cloud server having device for managing server and local server
US20190163528A1 (en) Automated capacity management in distributed computing systems
CN104750538B (en) Method and system for providing virtual storage pool for target application
US11550672B1 (en) Machine learning to predict container failure for data transactions in distributed computing environment
CN118012719B (en) Container running state monitoring method, intelligent computing cloud operating system and computing platform
US10540202B1 (en) Transient sharing of available SAN compute capability
Tran et al. Proactive stateful fault-tolerant system for kubernetes containerized services
CN117851257A (en) Distributed software testing environment construction system based on cloud computing
Soualhia et al. ATLAS: An adaptive failure-aware scheduler for hadoop
KR102188987B1 (en) Operation method of cloud computing system for zero client device using cloud server having device for managing server and local server
CN117762345A (en) Predictive storage optimization method and system for distributed storage system
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
Jassas A framework for proactive fault tolerance in cloud-IoT applications
CN118041755B (en) Forward fault recovery method, cloud operating system and intelligent computing platform
CN117519052B (en) Fault analysis method and system based on electronic gas production and manufacturing system
CN115514775B (en) Data processing method, device, equipment and storage medium
CN118012662B (en) Distributed fault restoration method, intelligent computing cloud operating system and computing platform
CN118118427A (en) Performance optimization method and optimization system for lossless network communication sub-card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant