CN118012662B - Distributed fault restoration method, intelligent computing cloud operating system and computing platform - Google Patents

Distributed fault restoration method, intelligent computing cloud operating system and computing platform Download PDF

Info

Publication number
CN118012662B
CN118012662B CN202410416084.0A CN202410416084A CN118012662B CN 118012662 B CN118012662 B CN 118012662B CN 202410416084 A CN202410416084 A CN 202410416084A CN 118012662 B CN118012662 B CN 118012662B
Authority
CN
China
Prior art keywords
real
operation data
abnormal
time operation
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410416084.0A
Other languages
Chinese (zh)
Other versions
CN118012662A (en
Inventor
邓练兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Qinzhi Technology Research Institute Co ltd
Original Assignee
Guangdong Qinzhi Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Qinzhi Technology Research Institute Co ltd filed Critical Guangdong Qinzhi Technology Research Institute Co ltd
Priority to CN202410416084.0A priority Critical patent/CN118012662B/en
Publication of CN118012662A publication Critical patent/CN118012662A/en
Application granted granted Critical
Publication of CN118012662B publication Critical patent/CN118012662B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The application belongs to the field of data processing, and particularly relates to a distributed fault restoration method, an intelligent computing cloud operating system and a computing platform, wherein the method comprises the following steps: acquiring real-time operation data of an intelligent computing cloud operating system; carrying out stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data; inputting a plurality of real-time operation data streams into a branch module of a stream type extensible anomaly detection model in parallel, and respectively executing dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams; and summarizing the abnormal operation data flow through the dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system. The method utilizes the streaming extensible anomaly detection model to process real-time operation data in parallel, and realizes real-time anomaly detection and dynamic fault repair so as to enhance the stability and reliability of the system.

Description

Distributed fault restoration method, intelligent computing cloud operating system and computing platform
Technical Field
The application belongs to the field of data processing, and particularly relates to a distributed fault restoration method, an intelligent computing cloud operating system and a computing platform.
Background
In order to promote popularization of intelligent application in various industries and fields, construction of an intelligent computing platform and an assisted intelligent super computing center is urgently needed to be established, basic construction of an artificial intelligent platform is provided for scientific research, industry and urban service, and talent aggregation, industry upgrading and development are further achieved. Application containerization is a technique to package applications and all their dependencies into a separate, portable container. The containerization technique allows applications, libraries, configuration files, and other dependencies to be bundled together to ensure consistent operation in a variety of environments, improving deployment efficiency, portability, and flexibility, allowing developers to more easily manage and deploy applications.
In the related art, real-time streaming data often includes various formats and types, and the structure thereof may be dynamically changed. Traditional systems may be inefficient in processing such diverse and unstructured data, have limited processing power, and slow response, failing to meet the real-time requirements for fault location. Therefore, there is a need to design a distributed fault repair scheme to solve at least one of the above-mentioned problems.
Disclosure of Invention
The application provides a distributed fault repairing method, an intelligent computing cloud operating system and a computing platform, which are used for processing real-time operation data in parallel by utilizing a streaming type extensible anomaly detection model to realize real-time anomaly detection and dynamic fault repairing so as to enhance the self-adaptive capacity, stability and reliability of the system.
In a first aspect, the present application provides a distributed fault repair method, applied to an intelligent computing cloud operating system, where the intelligent computing cloud operating system is used for running and managing intelligent platform resources; the distributed fault repair method comprises the following steps:
Acquiring real-time operation data of the intelligent computing cloud operating system; the real-time operation data at least comprises: real-time node load information, node availability information and data access modes;
Performing stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data;
inputting a plurality of real-time operation data streams into a branch module of a stream type extensible anomaly detection model in parallel, and respectively executing dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams; wherein, a plurality of branch modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system;
And summarizing the abnormal operation data flow through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system.
In a second aspect, an embodiment of the present application provides an intelligent computing cloud operating system, where the intelligent computing cloud operating system is configured to run and manage intelligent platform resources; the intelligent computing cloud operating system includes:
the acquisition unit is configured to acquire real-time operation data of the intelligent computing cloud operating system; the real-time operation data at least comprises: real-time node load information, node availability information and data access modes;
The flow dividing unit is configured to perform flow dividing processing on the real-time operation data to obtain a plurality of real-time operation data flows corresponding to the real-time operation data;
The detection unit is configured to input a plurality of real-time operation data streams into the branch modules of the stream type extensible exception detection model in parallel, and the plurality of branch modules respectively execute dynamic fuzzy detection processing of the corresponding real-time operation data streams so as to obtain exception operation data streams in the plurality of real-time operation data streams; wherein, a plurality of branch modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system;
and the repair unit is configured to summarize the abnormal operation data flow through a dynamic fault repair model and dynamically repair the intelligent computing cloud operating system.
In a third aspect, an embodiment of the present application provides an intelligent computing platform, including:
At least one processor, memory, and input output unit;
Wherein the memory is configured to store a computer program and the processor is configured to invoke the computer program stored in the memory to perform the distributed fault remediation method of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided that includes instructions that, when executed on a computer, cause the computer to perform the distributed fault remediation method of the first aspect.
The technical scheme provided by the embodiment of the application can be applied to an intelligent computing cloud operating system. The intelligent computing cloud operating system is used for running and managing intelligent platform resources. In the scheme, firstly, the real-time operation data of the intelligent computing cloud operating system are acquired. Wherein, the real-time operation data at least comprises: real-time node load information, node availability information, and data access modes. And further, carrying out stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data. And then, inputting the plurality of real-time operation data streams into a branch module of the stream type extensible anomaly detection model in parallel, and respectively executing the dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams. The real-time data is subjected to streaming segmentation processing through the streaming segmentation processing so as to support parallel processing, so that the data processing efficiency is improved, and the capability of the system for processing large-scale data is enhanced. And the flow type expandable anomaly detection model deployed at different nodes performs dynamic fuzzy detection on the data flow, so that the accuracy and the sensitivity of anomaly detection are improved. Wherein a plurality of branching modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system. Here, the branch modules of the flow type extensible anomaly detection model are distributed and deployed in different processing nodes, and the distributed architecture can effectively utilize system resources, balance loads and improve processing speed. And finally, summarizing the abnormal operation data flow through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system. By summarizing abnormal data flows and repairing by utilizing a dynamic fault repairing model, the distributed characteristic of the system is fully considered in the process, so that accurate and efficient repairing can be realized for faults of different nodes and services, and the service continuity and the data integrity can be maintained even when complex and changeable faults are faced.
According to the technical scheme, the running data can be acquired and processed in real time, so that the abnormal conditions and faults in the system can be responded and processed quickly, and the downtime of the system is reduced. The distributed architecture of the flow type extensible anomaly detection model is utilized for anomaly detection, so that system resources can be fully utilized, the processing efficiency is improved, and the response time is shortened. The dynamic fault repair can take the most proper repair measures according to the current state of the system and the nature of the fault, and the repair capability and flexibility of the system are improved. In summary, real-time operation data is processed in parallel through a streaming type extensible anomaly detection model, so that real-time anomaly detection and dynamic fault repair are realized, and the self-adaptive capacity, stability and reliability of the system are enhanced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of a distributed fault remediation method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an intelligent computing cloud operating system according to an embodiment of the present application;
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
In order to promote popularization of intelligent application in various industries and fields, construction of an intelligent computing platform and an assisted intelligent super computing center is urgently needed to be established, basic construction of an artificial intelligent platform is provided for scientific research, industry and urban service, and talent aggregation, industry upgrading and development are further achieved. Application containerization is a technique to package applications and all their dependencies into a separate, portable container. The containerization technique allows applications, libraries, configuration files, and other dependencies to be bundled together to ensure consistent operation in a variety of environments, improving deployment efficiency, portability, and flexibility, allowing developers to more easily manage and deploy applications.
Cloud computing is an emerging computing model that provides on-demand computing resources and services over a network. The core idea of cloud computing is to distribute computing tasks over a large number of computer-made resource pools, enabling various applications to acquire computing power, storage space, and various software services as needed. Intelligent computing is a technology for simulating human intelligence, and the process of automatically completing complex tasks by a computer is realized by simulating the thinking mode and learning capacity of a human. Resource management techniques are techniques related to how to efficiently allocate and schedule system resources to meet user demands.
In the related art, real-time streaming data generally covers various formats and types, and the structure thereof may dynamically change with time. Conventional systems tend to be inefficient in handling such diverse and unstructured data and have difficulty accommodating rapid changes in data formats. This dynamics and diversity presents challenges for data processing because conventional systems often rely on static data models and predefined data structures for processing. In practice, however, the format and structure of the data may change frequently due to changes in source data, adjustment of business requirements, or influence of external environments. Thus, conventional systems have limitations in accommodating such dynamics and diversity, which may lead to inefficiency, instability, and inadequate processing capabilities of the data processing.
Therefore, there is a need to design a distributed fault repair scheme to solve at least one of the above-mentioned problems.
The embodiment of the application provides a distributed fault restoration method, an intelligent computing cloud operating system and a computing platform.
In particular, the distributed fault repair scheme can be applied to an intelligent computing cloud operating system. The intelligent computing cloud operating system is used for running and managing intelligent platform resources. In the scheme, firstly, the real-time operation data of the intelligent computing cloud operating system are acquired. Wherein, the real-time operation data at least comprises: real-time node load information, node availability information, and data access modes. And further, carrying out stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data. And then, inputting the plurality of real-time operation data streams into a branch module of the stream type extensible anomaly detection model in parallel, and respectively executing the dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams. The real-time data is subjected to streaming segmentation processing through the streaming segmentation processing so as to support parallel processing, so that the data processing efficiency is improved, and the capability of the system for processing large-scale data is enhanced. And the flow type expandable anomaly detection model deployed at different nodes performs dynamic fuzzy detection on the data flow, so that the accuracy and the sensitivity of anomaly detection are improved. Wherein a plurality of branching modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system. Here, the branch modules of the flow type extensible anomaly detection model are distributed and deployed in different processing nodes, and the distributed architecture can effectively utilize system resources, balance loads and improve processing speed. And finally, summarizing the abnormal operation data flow through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system. By summarizing abnormal data flows and repairing by utilizing a dynamic fault repairing model, the distributed characteristic of the system is fully considered in the process, so that accurate and efficient repairing can be realized for faults of different nodes and services, and the service continuity and the data integrity can be maintained even when complex and changeable faults are faced.
In the distributed fault restoration scheme, by acquiring and processing the operation data in real time, the system can quickly respond to and process the abnormality and fault occurring in the system, and the downtime of the system is reduced. The distributed architecture of the flow type extensible anomaly detection model is utilized for anomaly detection, so that system resources can be fully utilized, the processing efficiency is improved, and the response time is shortened. The dynamic fault repair can take the most proper repair measures according to the current state of the system and the nature of the fault, and the repair capability and flexibility of the system are improved. In summary, real-time operation data is processed in parallel through a streaming type extensible anomaly detection model, so that real-time anomaly detection and dynamic fault repair are realized, and the self-adaptive capacity, stability and reliability of the system are enhanced.
The distributed fault restoration scheme provided by the embodiment of the application can be executed by a chip. Among other things, the chips described herein may be general-purpose processors, including artificial intelligence processors, graphics processors (Graphics Processing Unit, GPU), artificial intelligence processor cards (MACHINE LEARNING Unit, MLU), central processing units (Central Processing Unit, CPU), network processors (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Further alternatively, the artificial intelligence chip and accelerator card designs may employ high performance MLUs as the base module of the intelligent platform. The MLU high-performance low-power-consumption artificial intelligent processor card adopts the latest architecture, the equivalent theoretical peak speed can reach 128 trillion fixed-point operations per second, typical board-level power consumption is only 80 watts, and the peak power consumption is not more than 110 watts. The high-performance artificial intelligent server can be built in a modularized manner based on the MLU, and different intelligent application loads can be flexibly handled.
The distributed fault restoration scheme provided by the embodiment of the application can also be executed by electronic equipment, and the electronic equipment can be a server, a server cluster and a cloud server. The electronic device may also be a terminal device such as a cell phone, computer, tablet, wearable device, or a dedicated device (e.g., a dedicated terminal device with a distributed fault remediation system, etc.). The chips described in the above embodiments may be mounted on these electronic devices. Or the electronic devices may also install a service program for performing the distributed failover scheme.
In the embodiment of the application, the intelligent computing cloud operating system is mainly responsible for storing various related data such as input data, computing results, observation data, visual data and the like of the advanced computing platform. The data may be from different applications and require unified management and storage for subsequent analysis and processing.
Fig. 1 is a schematic diagram of a distributed fault repair method according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
101, acquiring real-time operation data of the intelligent computing cloud operating system.
In the embodiment of the application, the intelligent computing cloud operating system is used for running and managing intelligent platform resources, and the intelligent platform resources at least comprise: computing resources, storage resources, network resources. Computing resources refer to hardware resources for performing various computing tasks, including artificial intelligence chips, artificial intelligence boards, central Processing Units (CPUs), graphics Processing Units (GPUs), programmable gate arrays (FPGAs), and the like. But may also be virtualized resources such as virtual machines, containers, etc. These computing resources are used to perform computationally intensive tasks such as algorithms, model training, reasoning, and the like. Storage resources, including hardware resources for data storage and management, encompass a variety of storage media and storage devices, such as Hard Disk Drives (HDDs), solid State Drives (SSDs), network storage (NAS), object storage, and the like. These resources are used to store data sets, model parameters, logs, etc. information. Network resources refer to network devices and bandwidth resources used for connection and communication, including ethernet switches, routers, fiber optic communication devices, and the like. These resources are used to enable communication between components within the intelligent platform and data exchange between the intelligent platform and external systems.
In addition, the intelligent platform resources not only include computing resources, storage resources and network resources, but also can be expanded into the following categories:
sensor resources, if the intelligent platform is involved in sensing and acquisition tasks, may also include various sensor resources such as cameras, sound sensors, temperature sensors, and the like. The sensors are used for collecting environmental information, images, sounds and other data and providing input for the intelligent system.
Edge computing resources, as edge computing progresses, the intelligent platform resources may further include computing, storage, and network resources distributed at edge nodes for performing part of computing tasks, storing data, or responding in real time at the edge.
Secure resources, intelligent platform resources also need to consider security, possibly including encryption modules, secure storage devices, access control devices, etc., for protecting the security and privacy of systems and data.
In summary, the scope of the intelligent platform resources not only includes computing, storing and network resources, but also includes sensor resources, edge computing resources, security resources and the like, and these resources together form the infrastructure of the intelligent computing cloud operating system, so as to provide support for various intelligent applications.
In the present application, the real-time operation data at least includes: real-time node load information, node availability information, and data access modes. The real-time node load information refers to the current load condition of each node, including CPU utilization rate, memory occupation, network bandwidth utilization rate and the like. The data can help the system monitor the working state of the node and judge whether the node is in an overload state or a resource underutilization state. Node availability information, including the availability status of each node, i.e., whether the node is currently available for processing tasks. Availability information may include the online/offline status of the node, fault reports or error logs, etc. By monitoring the availability information of the nodes, the system can timely discover the faults or the unavailability of the nodes. The data access mode refers to an access mode of a user or an application program to data, and comprises operations such as reading, writing or modifying the data. The data access pattern may help the system to learn current data traffic and access requirements, helping to optimize data transmission paths and resource allocation.
The collection and analysis of these real-time operational data is critical to the dynamic fault repair model. The method provides necessary information for an abnormality locating layer and an abnormality identifying layer, helps a system to accurately locate and identify possible abnormal nodes, and provides guidance for a dynamic repairing layer so as to timely take proper measures to repair the abnormal nodes, thereby ensuring the stability and usability of the system.
102, Performing stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data.
In an embodiment of the present application, the streaming segmentation process refers to the segmentation and processing of real-time operation data, which is divided into a plurality of real-time operation data streams. This may be done by dividing the data into multiple streams, each stream corresponding to a set of related data, based on the characteristics, type, or other criteria of the real-time data. The purpose of this is to better process and analyze data of different types or sources to improve the efficiency and performance of the system.
Specifically, first, the real-time running data may be partitioned according to some rule or condition, such as according to a data source node, a data type, a time stamp, etc. This may divide the data into a plurality of logically related portions. Each segmented data portion is processed into an independent data stream. These data streams may undergo a series of processing steps, such as filtering, converting, aggregating, etc., to meet the needs of subsequent processing. Further, the processed data streams may be combined or sent to different processing modules for subsequent analysis or processing. In this process, multiplexing, i.e. simultaneous processing of multiple data streams, may be required.
In the embodiment of the application, the streaming segmentation process is beneficial to dividing the real-time operation data into a plurality of streams according to different characteristics or sources, so that the system can process and analyze the data more efficiently, thereby improving the accuracy and efficiency of anomaly detection, identification and repair.
103, Inputting the plurality of real-time operation data streams into the branch modules of the stream type extensible anomaly detection model in parallel, and respectively executing the dynamic blur detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams.
In the embodiment of the application, the stream type expandable anomaly detection model is a model for anomaly detection in real-time data stream and has the characteristics of stream type processing and expandability. The model is capable of processing data streams generated in real time without waiting for all data to arrive before processing. It can continuously receive, process and analyze data streams and discover anomalies that may exist therein in time. The streaming capability allows the model to monitor and detect in real time as the data source continually generates data, thereby more timely discovering anomalies. The stream type extensible anomaly detection model can adapt to data stream environments with different scales and complexity. The processing capacity can be dynamically expanded according to the actual demand to cope with the increase or change of the data quantity. This scalability allows the model to be adapted to different application scenarios and still maintain efficient anomaly detection capabilities in the face of large-scale data streams. In such models, various data processing and analysis techniques, such as machine learning algorithms, data mining techniques, etc., are typically employed to enable the identification and detection of abnormal conditions. The model includes a plurality of components or sub-models for processing data streams of different types or sources and combining the results during analysis to arrive at a final anomaly detection result.
In general, the streaming scalable anomaly detection model aims to identify anomalies therein by processing and analyzing data streams in real time, thereby helping users to discover and deal with potential problems in time and improving the stability and reliability of the system.
As an alternative embodiment, it is assumed that each branching module in the stream scalable anomaly detection model includes at least: the system comprises a fuzzy tree structure construction layer, a path planning layer, a path integration layer, an anomaly detection layer and an output layer. For example, according to the step description in 103, the structure of the branching module of the stream scalable anomaly detection model can be as follows:
fuzzy tree structure construction layer: at this level, the real-time operational data stream is circularly partitioned for building a fuzzy tree structure network. This network includes a root grid and She Wangge. According to the design of the fuzzy tree structure, the data stream forms a grid structure in the process of continuous segmentation and processing, wherein each grid represents data within a certain range.
Path planning layer: this layer is responsible for obtaining path planning data from the root mesh to each leaf mesh in the fuzzy tree structure network. The path planning data records paths that are traversed from the root mesh to each leaf mesh, which may be data transmission paths, processing paths, or other types of paths.
Path integration layer: at this layer, the path length of each leaf grid is calculated based on the path planning data. The path length may be considered as one of the attributes of the leaf grid that reflects the distance or path that data travels in the fuzzy tree structure network.
An abnormality detection layer: the layer determines anomaly scores for the leaf meshes based on the path lengths of each leaf mesh. The anomaly score represents the degree of anomaly of the leaf grid, the value of which may depend on the path length or other anomaly-related factors. If the anomaly score exceeds the dynamic anomaly threshold, the leaf grid is marked as an anomaly grid, representing that there may be an anomaly in the portion of data.
Output layer: and finally, extracting the real-time operation data stream corresponding to the abnormal grid at the output layer to form an abnormal operation data stream. The abnormal operation data stream contains a portion of the original real-time data stream identified as abnormal.
The model structure can effectively identify abnormal conditions in the abnormal operation data stream by processing a plurality of real-time operation data streams in parallel and executing corresponding dynamic fuzzy detection processing in each branch module. Each component in the model has a specific function, and accurate detection and identification of the abnormality are realized through mutual cooperation. Meanwhile, the self-adaptive evaluation of the dynamic abnormal threshold value can enable the model to have the capability of adapting to different environments and conditions, so that the robustness and the practicability of the model are improved.
Based on the above-mentioned hypothesis structure, 103, a plurality of real-time operation data streams are input into the branching modules of the stream-type scalable anomaly detection model in parallel, and the plurality of branching modules respectively execute the motion blur detection processing of the corresponding real-time operation data streams to obtain the anomaly operation data streams in the plurality of real-time operation data streams, which may be implemented as the following steps:
201, circularly dividing a real-time operation data stream corresponding to a current branch module through a fuzzy tree structure construction layer to construct a fuzzy tree structure network corresponding to the current branch module; the fuzzy tree structure network comprises a root grid and She Wangge;
202, obtaining path planning data which is experienced by a root grid reaching each leaf grid in a fuzzy tree structure network through a path planning layer;
203, calculating the path length of each leaf grid based on the path planning data through a path integration layer;
204, determining an anomaly score of each leaf grid based on the path length of each leaf grid through an anomaly detection layer, and taking the leaf grid with the anomaly score larger than a dynamic anomaly threshold as an anomaly grid; the dynamic abnormal threshold is obtained through self-adaptive evaluation of real-time load conditions and/or node availability in the intelligent computing cloud operating system;
and 205, using the real-time operation data stream corresponding to the abnormal grid as the abnormal operation data stream through an output layer.
Further alternatively, in 201, by using a fuzzy tree structure building layer, the real-time operation data stream corresponding to the current branching module is circularly segmented, so as to build a fuzzy tree structure network corresponding to the current branching module, which may be implemented as the following steps:
301, receiving a real-time operation data stream corresponding to a current branching module;
302, randomly selecting any axis as a dividing axis in an initial cycle period, and dividing a real-time operation data stream by the selected dividing axis to obtain two first sub-data streams;
303, in the second cycle period, re-selecting another arbitrary axis as a dividing axis, and dividing the two first sub-data streams respectively by using the selected dividing axis to obtain two second sub-data streams respectively corresponding to the two first sub-data streams;
304, repeating the cycle period, and splitting each sub-data stream until reaching the splitting stop condition, stopping the cycle, so as to obtain the fuzzy tree structure network.
In the embodiment of the application, the fuzzy tree structure network is a network model for describing ambiguity and a hierarchical structure in a data stream. It is typically used to process real-time data streams and construct a hierarchical structure based on the characteristics and relevance of the data to more efficiently process and analyze the data. In a fuzzy tree structure network, data is organized into a tree structure including a root node and a plurality of branch nodes. Each node represents a subset of the data stream or a range of data. There may be a hierarchical relationship between these nodes that represents the logical or actual relevance of the data.
"Ambiguity" of such a network structure refers to the fact that there may be some overlapping or ambiguous relationship between nodes, i.e., a certain data point may belong to multiple nodes at the same time, rather than being strictly divided among a certain node. Such ambiguity may allow the network to be more flexible to accommodate different types and features of data streams.
The fuzzy tree structure network can be used in the fields of data analysis, anomaly detection and the like, and the relationship between the data can be better understood by organizing the data according to a certain hierarchical structure, so that the efficiency and the accuracy of data processing and analysis are improved. In tasks such as anomaly detection, the fuzzy tree network can help identify the data subset of the anomaly and speed up the anomaly detection process.
Specifically, steps 201 to 205 are specific steps for processing a data stream and detecting an anomaly in a stream scalable anomaly detection model, and are briefly described as follows:
In 201, for the real-time operation data stream corresponding to each branch module, a fuzzy tree structure construction method is adopted to circularly divide and organize the data stream into a fuzzy tree structure network. The network includes a root grid and She Wangge that, by splitting and organizing the data streams, forms a hierarchical structure for subsequent processing and analysis.
At 202, at a path planning layer, path planning data from a root mesh to each leaf mesh in a fuzzy tree structure network is obtained. These path planning data record the paths that go from the root mesh to each leaf mesh, which may be data transmission paths, processing paths, or other types of paths. These data will be used for subsequent path integration calculations.
At 203, at the path integration layer, the path length of each leaf grid is calculated based on the path planning data. The path length reflects the distance or path that data travels in the fuzzy tree structure network and is one of the indicators that evaluate the degree of association between the parts in the data stream.
In 204, the anomaly detection layer determines an anomaly score for each leaf grid based on the path length of each leaf grid. The anomaly score indicates the degree of anomaly of the leaf grid, and if the score exceeds a dynamic anomaly threshold, the leaf grid is marked as an anomaly grid. The dynamic anomaly threshold value is adaptively evaluated according to the real-time load condition and/or node availability in the intelligent computing cloud operating system.
And 205, finally, extracting the real-time operation data stream corresponding to the abnormal grid at the output layer to form an abnormal operation data stream. This abnormal operation data stream contains the portion of the original real-time data stream identified as abnormal and can be used for further analysis and processing.
The steps together form a flow for processing and anomaly detecting the real-time operation data flow in the flow type extensible anomaly detection model, and the timely identification and processing of the anomaly condition are realized by processing a plurality of data flows in parallel and combining a dynamic fuzzy detection method.
In the embodiment of the present application, further, an optional segmentation stopping condition is: the number of meshes in each sub-data stream is below a dynamic mesh threshold. The condition refers to that when the grid number corresponding to a certain sub-data stream is lower than a dynamic grid threshold value in the construction process of the fuzzy tree structure network, the sub-data stream is considered to have reached the stopping condition of segmentation. The dynamic grid threshold can be dynamically adjusted according to the characteristics and requirements of real-time operation data streams, so that the model can effectively process the data streams with different scales and densities. When the number of meshes of a certain sub-data stream is below the dynamic mesh threshold, this means that the subdivision of the sub-data stream is already fine enough that no further segmentation is required. Thus, the model in this case will stop further splitting of the sub-stream, and will then process the next sub-stream or perform other operations.
Another cut stop condition is that the fuzzy tree structure network reaches a preset maximum height. In the process of model construction, a maximum height is usually preset to limit the depth of the fuzzy tree structure network and avoid excessive hierarchical segmentation. When the fuzzy tree structure network reaches the preset maximum height, the maximum limit of hierarchical division is reached, and the fuzzy tree structure network is not suitable for continuous segmentation. Thus, the model in this case will stop further segmentation of the entire network, completing the model building process.
The setting of the two segmentation stopping conditions aims at ensuring that the construction process of the fuzzy tree structure network is stopped at a proper time, and invalid segmentation and excessive hierarchical structure are avoided, so that the efficiency and performance of the model are improved.
Further, in the foregoing step, the randomly selecting an arbitrary axis as the dividing axis, or the re-selecting another arbitrary axis as the dividing axis may be implemented as:
Firstly, mapping a real-time operation data stream corresponding to a current branching module to a data characteristic group image space to form a real-time operation data characteristic group. And further, based on the distribution condition of the data characteristic elements in the real-time operation data characteristic group, randomly selecting a segmentation position and a segmentation angle, and recording the selected segmentation position and the segmentation angle to form a segmentation axis. Wherein, any axis is one of any linear axis, nonlinear axis and multi-type combined axis in the data characteristic group image space.
The description refers to a method for randomly selecting or re-selecting a segmentation axis in the construction process of a fuzzy tree structure network, which comprises the following specific steps:
Firstly, mapping the real-time operation data stream corresponding to the current branching module to a data characteristic group image space. This mapping process converts the data points in the original data stream into feature vectors or feature descriptions to form real-time operational data feature clusters. In this feature cluster image space, each data point may be represented by a feature vector that describes the attributes and features of the data point in the various feature dimensions. Based on the distribution condition of the data characteristic elements in the real-time operation data characteristic group, a random selection method is adopted to determine the segmentation position and the segmentation angle. This means that in the data feature cluster image space, a position is randomly selected as the slicing position, and an angle is randomly determined as the slicing angle. This procedure ensures a diversified segmentation in the feature space, resulting in better coverage and robustness when constructing a fuzzy tree-structured network. Once the segmentation position and segmentation angle are determined, they can be recorded, constituting a segmentation axis. This segmentation axis will be used to segment the data stream in the data feature group image space, dividing the data stream into different subsets or grids. Recording the position and angle of the segmentation axis facilitates the subsequent data segmentation and model construction process and ensures that certain randomness and diversity is maintained at each segmentation step.
In general, the method for randomly selecting or re-selecting the segmentation axis can effectively increase the diversity and robustness of the model and improve the adaptability of the fuzzy tree structure network to different types of data streams, thereby describing the ambiguity and the hierarchical structure in the data streams more accurately.
In the embodiment of the present application, in the anomaly detection layer, n is the total number of grids in the fuzzy tree structure network, and the anomaly score S (x, n) of the xth leaf grid is expressed as the following formula:
wherein H (x) represents the average path length of the xth leaf trellis in all independent tree trellis structures in the fuzzy tree structure network, H (n-1) represents the expected value of the average path length of all leaf trellis when the total number of trellis is n-1, An interpretable factor for adjusting the anomaly score is represented.
104, Summarizing the abnormal operation data flow through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system.
As an alternative embodiment, it is assumed that the dynamic fault repair model includes at least: an abnormality locating layer, an abnormality identifying layer and a dynamic repairing layer. The following are descriptions of their respective actions: the abnormal positioning layer is responsible for receiving the abnormal operation data stream in real time and associating the abnormal operation data stream with the corresponding node according to the data source node so as to accurately position the node which may have problems. By collecting real-time abnormal operation data, the layer can rapidly identify the nodes which are likely to have faults, and provide accurate positioning for subsequent fault diagnosis and repair. The anomaly identification layer performs anomaly judgment on the candidate nodes based on historical operation data of the candidate abnormal nodes, the network topology structure and the real-time processing data stream, and determines whether the candidate nodes are in an abnormal state. By analyzing the historical behavior of the node, the communication condition with the adjacent node and the real-time data processing condition, the layer can distinguish which nodes are actually in an abnormal state, so that false alarms and false judgments are avoided. And the dynamic repair layer is responsible for repairing the abnormal node in real time according to the abnormal recognition result when the system fails, so that the abnormal node is recovered to a healthy running state. Once the abnormal node is determined, the layer will initiate a corresponding repair mechanism, possibly including operations of reconnecting the node, adjusting the network topology, increasing the bandwidth, etc., to resume the normal operation of the system as soon as possible.
The cooperation of the three layers enables the system to realize timely monitoring, accurate diagnosis and effective repair of abnormal nodes, so that stability and usability of the system are improved, and the system is ensured to quickly recover and keep normal operation under various abnormal conditions.
Based on the above structure, in 104, by means of the dynamic fault repair model, summarizing the abnormal operation data stream, and dynamically repairing the intelligent computing cloud operating system, it may be implemented as follows:
401, receiving the abnormal operation data stream in real time through an abnormal positioning layer, and taking a data source node corresponding to the abnormal operation data stream as a candidate abnormal node;
402, judging whether the candidate abnormal node is an abnormal node or not based on the historical operation data of the candidate abnormal node, the network topological structure and the real-time processing data flow through an abnormal recognition layer;
403, starting repair processing on the abnormal node in the intelligent computing cloud operating system through the dynamic repair layer so as to recover the abnormal node to a healthy running state.
Illustratively, in an intelligent computing cloud operating system, it is assumed that there is a set of nodes for processing tasks submitted by users. The nodes are interconnected by a network to form a distributed system. It is assumed that one of the nodes suddenly fails, resulting in an inability to process tasks normally. To achieve dynamic fault remediation, the following steps may be performed:
In 401, the anomaly locating layer receives the abnormal operation data stream in real time, and associates the abnormal operation data stream with the corresponding node according to the data source node. When an abnormality occurs in a node, the abnormality locating layer identifies the node as a candidate abnormal node.
In 402, the anomaly identification layer performs anomaly determination based on historical operation data of the candidate anomaly node, the network topology where the candidate anomaly node is located, and the real-time processing data stream. And determining whether the node is actually in an abnormal state or not by analyzing the historical behavior of the node, the communication condition of the node and the adjacent node and the real-time data processing condition.
In 403, once the anomaly identification layer confirms that a node is an anomaly node, the dynamic repair layer initiates repair processing for the node. The repair process may include reconnecting the nodes, adjusting the network topology, increasing bandwidth, etc., to restore the abnormal nodes to a healthy operational state.
For example, assume that a task node is abnormal in processing a task submitted by a user, and cannot normally respond to a request. The anomaly locating layer receives the abnormal operation data stream of the node in real time and identifies the node as a candidate abnormal node. The anomaly identification layer analyzes the historical operation data of the node to find that the node frequently has overlong response time in the near term and communication with adjacent nodes is abnormal. According to the abnormal conditions, the abnormal recognition layer judges the node as an abnormal node. And then, the dynamic repair layer starts the repair processing of the node according to the abnormal identification result, and possibly reconnects the node or adjusts the network topology structure, so that the node is finally restored to the healthy running state.
Through the dynamic fault repair model, the system can realize real-time monitoring and processing of abnormal nodes, improves the stability and usability of the system, and ensures that tasks submitted by users can be processed timely and effectively.
In the step 403, the repairing process of the abnormal node is started in the intelligent computing cloud operating system through the dynamic repairing layer, so that the abnormal node is restored to the healthy running state, which may be implemented as the following process:
First, an abnormality type to which an abnormality node belongs is determined. For example, machine learning, statistics, or rules engine techniques are used for anomaly type classification. Alternatively, rule-based classification, machine learning classification, or statistical methods may be employed to determine the type of anomaly. Further, for different exception types, the following different processing strategies are performed:
strategy one: if the abnormal node is a redundant node, removing or isolating the node with the abnormal frequency higher than the set upper limit of the available based on the preset dynamic redundancy restoration strategy.
Specifically, an upper limit is set based on the dynamic loading and redundancy of the system. If the anomaly frequency of the anomaly node exceeds this upper limit, the anomaly node is considered to be anomalous. This upper limit can be dynamically adjusted according to the system requirements and the actual situation. For redundant nodes with anomaly frequencies above the upper limit of availability, it may be selected to be removed or isolated from the system. Isolation may include temporarily disabling the node or setting it to a standby state to prevent it from continuing to participate in data processing.
In an alternative example, setting a usable upper limit requires consideration of the overall design of the system, performance metrics, and acceptable risk levels. An acceptable upper anomaly frequency limit is determined based on the requirements and objectives of the system design. This needs to take into account the Service Level Agreements (SLAs) of the system and the availability level desired by the user. Performance metrics of the system, such as response time, throughput, etc., and the impact on these metrics are considered. If the frequency of the abnormal node exceeds a certain threshold, unacceptable impact may be exerted on system performance. And analyzing the historical data to know the occurrence frequency and influence degree of the abnormal nodes. According to historical data, a reasonable abnormal frequency upper limit can be set so as to ensure that the system can normally operate under most conditions. The risk and consequences of exceeding the upper limit of the anomaly frequency are assessed. This includes factors such as reduced system performance, data loss, reduced user experience, etc. An acceptable upper limit of anomaly frequency is determined based on the risk assessment. And establishing a real-time monitoring system, monitoring the occurrence frequency of abnormal nodes in the system, and timely adjusting the upper limit of the availability. Thus, the upper limit of the availability can be dynamically adjusted according to actual conditions so as to adapt to the change of the system operation. By comprehensively considering the above factors, a reasonable upper limit of availability can be determined to ensure that the system can keep stable running under normal and abnormal conditions.
Strategy II: if the data in the abnormal node is abnormal, the number of the data copies of the abnormal node is increased, and the data copies are stored on the expanded backup nodes.
In the embodiment of the application, the backup node is a node in a healthy state and in a healthy state in a data access mode. It will be appreciated that the backup node must be in a healthy state, i.e. the hardware devices are intact, the network connections are normal, the system is running stably, etc. Only healthy nodes can reliably take on data backup and processing tasks. Nodes in a healthy state generally have stable performance and reliable data storage capacity, and can effectively cope with abnormal situations and ensure the safety and usability of data. The data access mode of the backup node must be in a healthy state, i.e. able to normally respond to data read-write requests, and no loss or error occurs during data transmission. The health state of the data access mode means that the backup node can synchronize and interact with the main node according to an expected mode, so that the consistency and the integrity of the data are ensured.
For the nodes with data anomalies, the number of the data copies can be increased so as to improve the redundancy and reliability of the data. This can be achieved by increasing the number of copies of the data replication. And, there is also a need to ensure that copies of these data are stored on healthy backup nodes to prevent further propagation of anomalous data and compromise of system stability.
Further, when selecting backup nodes, strict screening of candidate nodes is required to ensure that they meet the requirements of both aspects. Only when the backup node has a health state and a normal data access mode, the backup node can be used as a proper backup node for storing the data copy of the abnormal node and taking over the data processing task. By selecting the nodes in the health state and the data access mode as backup nodes, the reliability and fault tolerance of the system can be improved, the data can be recovered in time under abnormal conditions, and the normal operation of the system can be ensured.
Strategy III: if a plurality of adjacent abnormal nodes exist, the network topology structure of the local network where the abnormal nodes are located is optimized.
In this way, the efficiency of data transmission and the stability of the system can be improved by adjusting the network topology structure of the adjacent abnormal nodes. This may involve operations such as reconfiguring network connections or increasing network bandwidth.
Illustratively, the network topology where the adjacent abnormal nodes are located is adjusted to improve data transmission efficiency and system stability. The following is a specific description of how this is done:
Firstly, the current network topology structure needs to be analyzed to know the connection mode, bandwidth condition and data transmission path between the nodes. This may be accomplished through a network topology map or a network monitoring tool. Further, it is determined which nodes are defined as abnormal nodes, and where they are located and adjacent nodes. These abnormal nodes may be due to hardware failures, network problems, or other reasons.
For nodes directly connected to the abnormal node, it may be considered to reconnect them to change the path of data transmission or increase the transmission bandwidth. This may require adjusting the configuration of the network device or changing the network connection.
If insufficient data transmission bandwidth between adjacent abnormal nodes results in low data transmission efficiency, an increase in bandwidth may be considered. This may be achieved by upgrading the network device, increasing the number of network links, or adjusting the network transport protocol. Or may adjust the routing policy to enable more efficient transmission of data to neighboring nodes. This may involve using a more optimal routing algorithm, adjusting packet forwarding rules, or optimizing network flow control policies.
For nodes requiring frequent communication, it may be considered to increase communication channels between them or to use dedicated communication lines to reduce delay and packet loss rate of data transmission.
After the adjustment, a real-time monitoring system needs to be established to monitor the change of the network topology structure and the improvement of the data transmission efficiency. And according to the monitoring result, timely adjusting network configuration and strategy to ensure the stability and performance of the system. Through the operation, the network topology structure of the adjacent abnormal nodes can be optimized, the data transmission efficiency and the system stability are improved, and therefore the operation requirements and abnormal conditions of the system are well adapted.
Strategy IV: if a plurality of adjacent abnormal nodes exist, a collective isolation operation is carried out on the plurality of adjacent abnormal nodes, and a backup node cluster is established for taking over the data processing tasks of the abnormal nodes which are isolated in a collective way.
In particular, if there are multiple adjacent exception nodes, a collective isolation operation may be performed to isolate the exception nodes from the host system to avoid their impact on the system from spreading. This may be accomplished by adjusting the routing policies of the system or the network partition.
Meanwhile, a backup node cluster needs to be established to take over the data processing tasks of the isolated abnormal nodes. This backup node cluster should have sufficient computing and storage resources to ensure continued operation of the system and data processing capabilities.
Through the processing strategy, different types of abnormal nodes can be effectively identified and processed, so that the stability and reliability of the system are improved.
In the embodiment of the application, the running data is acquired and processed in real time, so that the abnormality and the fault occurring in the system can be responded and processed quickly, and the downtime of the system is reduced. The distributed architecture of the flow type extensible anomaly detection model is utilized for anomaly detection, so that system resources can be fully utilized, the processing efficiency is improved, and the response time is shortened. The dynamic fault repair can take the most proper repair measures according to the current state of the system and the nature of the fault, and the repair capability and flexibility of the system are improved. In summary, real-time operation data is processed in parallel through a streaming type extensible anomaly detection model, so that real-time anomaly detection and dynamic fault repair are realized, and the self-adaptive capacity, stability and reliability of the system are enhanced.
In yet another embodiment of the present application, there is also provided an intelligent computing cloud operating system for running and managing intelligent platform resources; as described with reference to fig. 3, the intelligent computing cloud operating system includes the following elements:
the acquisition unit is configured to acquire real-time operation data of the intelligent computing cloud operating system; the real-time operation data at least comprises: real-time node load information, node availability information and data access modes;
The flow dividing unit is configured to perform flow dividing processing on the real-time operation data to obtain a plurality of real-time operation data flows corresponding to the real-time operation data;
The detection unit is configured to input a plurality of real-time operation data streams into the branch modules of the stream type extensible exception detection model in parallel, and the plurality of branch modules respectively execute dynamic fuzzy detection processing of the corresponding real-time operation data streams so as to obtain exception operation data streams in the plurality of real-time operation data streams; wherein, a plurality of branch modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system;
and the repair unit is configured to summarize the abnormal operation data flow through a dynamic fault repair model and dynamically repair the intelligent computing cloud operating system.
Further optionally, each branching module in the flow-type extensible anomaly detection model includes at least: the system comprises a fuzzy tree structure construction layer, a path planning layer, a path integration layer, an anomaly detection layer and an output layer;
The detection unit is used for inputting a plurality of real-time operation data streams into the branch modules of the stream type extensible abnormality detection model in parallel, respectively executing the dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain abnormal operation data streams in the plurality of real-time operation data streams, and is configured to:
circularly dividing the real-time operation data stream corresponding to the current branch module through a fuzzy tree structure construction layer to construct a fuzzy tree structure network corresponding to the current branch module; the fuzzy tree structure network comprises a root grid and She Wangge;
acquiring path planning data which is experienced by a root grid reaching each leaf grid in a fuzzy tree structure network through a path planning layer;
calculating a path length of each leaf grid based on the path planning data by a path integration layer;
Determining an anomaly score of each leaf grid based on the path length of each leaf grid through an anomaly detection layer, and taking the leaf grid with the anomaly score larger than a dynamic anomaly threshold as an anomaly grid; the dynamic abnormal threshold is obtained through self-adaptive evaluation of real-time load conditions and/or node availability in the intelligent computing cloud operating system;
And using the real-time operation data stream corresponding to the abnormal grid as the abnormal operation data stream through an output layer.
Further optionally, the detecting unit is configured to, when circularly dividing the real-time operation data stream corresponding to the current branching module through the fuzzy tree structure building layer to build the fuzzy tree structure network corresponding to the current branching module:
receiving a real-time operation data stream corresponding to a current branch module;
randomly selecting any axis as a dividing axis in an initial cycle period, and dividing the real-time operation data stream by the selected dividing axis to obtain two first sub-data streams;
In the second cycle, re-selecting another arbitrary axis as a dividing axis, and dividing the two first sub-data streams respectively by the selected dividing axis to obtain two second sub-data streams respectively corresponding to the two first sub-data streams;
repeating the cycle period, and cutting each sub-data stream until the cutting stopping condition is reached, so as to obtain the fuzzy tree structure network.
Further alternatively, the slicing stopping condition is: the number of grids in each sub-data stream is below a dynamic grid threshold; or the fuzzy tree structure network reaches a preset maximum height.
Further alternatively, the detecting unit is configured to, when randomly selecting an arbitrary axis as the dividing axis or newly selecting another arbitrary axis as the dividing axis:
mapping the real-time operation data stream corresponding to the current branch module to a data characteristic group image space to form a real-time operation data characteristic group;
randomly selecting a segmentation position and a segmentation angle based on the distribution condition of data characteristic elements in the real-time operation data characteristic group, and recording the selected segmentation position and the segmentation angle to form a segmentation axis;
wherein, any axis is one of any linear axis, nonlinear axis and multi-type combined axis in the data characteristic group image space.
Further optionally, in the anomaly detection layer, n is the total number of meshes in the fuzzy tree structure network, and the anomaly score S (x, n) of the xth leaf mesh is expressed as the following formula:
wherein H (x) represents the average path length of the xth leaf trellis in all independent tree trellis structures in the fuzzy tree structure network, H (n-1) represents the expected value of the average path length of all leaf trellis when the total number of trellis is n-1, An interpretable factor for adjusting the anomaly score is represented.
Further optionally, the dynamic fault repair model includes at least: an anomaly locating layer, an anomaly identifying layer and a dynamic repairing layer;
the repair unit is used for summarizing the abnormal operation data flow through a dynamic fault repair model and dynamically repairing the intelligent computing cloud operating system, and is configured to:
Receiving the abnormal operation data stream in real time through an abnormal positioning layer, and taking a data source node corresponding to the abnormal operation data stream as a candidate abnormal node;
Judging whether the candidate abnormal node is an abnormal node or not based on the historical operation data of the candidate abnormal node, the network topological structure and the real-time processing data flow through an abnormal recognition layer;
and starting the repair processing of the abnormal node in the intelligent computing cloud operating system through the dynamic repair layer so as to recover the abnormal node to a healthy running state.
Further optionally, the repairing unit, through a dynamic repairing layer, when initiating repairing processing on the abnormal node in the intelligent computing cloud operating system to restore the abnormal node to a healthy running state, is configured to:
Judging the abnormal type of the abnormal node;
If the abnormal node is a redundant node, removing or isolating the node with the abnormal frequency higher than the set upper limit of the available based on a preset dynamic redundancy repair strategy; or alternatively
If the data in the abnormal node is abnormal, increasing the number of data copies of the abnormal node, and storing the data copies to the expanded backup node; the backup node is a node with a healthy state and a healthy state in a data access mode; or alternatively
If a plurality of adjacent abnormal nodes exist, the network topology structure of the local network where the abnormal nodes are located is optimized, or a collective isolation operation is carried out on the plurality of adjacent abnormal nodes, and a backup node cluster is established for taking over the data processing task of the abnormal nodes which are isolated in a collective manner.
In the embodiment of the application, the real-time operation data is processed in parallel by using the streaming type extensible anomaly detection model, so that the real-time anomaly detection and dynamic fault repair are realized, and the self-adaptive capacity, stability and reliability of the system are enhanced.
In yet another embodiment of the present application, there is also provided an intelligent computing platform, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
A memory for storing a computer program;
And the processor is used for realizing the distributed fault repairing method according to the embodiment of the method when executing the program stored in the memory.
The communication bus 1140 referred to above for electronic devices may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like.
Illustratively, it is assumed that a large-scale, autonomously controllable intelligent computing platform based on a neural network dedicated chip needs to be built for providing a hardware basis for developing and building the intelligent computing platform. Meanwhile, the intelligent computing platform can also provide a hardware foundation for the construction of an intelligent supercomputer center, and the construction of the center can be used for artificial intelligent platforms for scientific research, industry and urban service, and gathering talents and developing industry.
Specifically, the intelligent computing platform mainly comprises: the intelligent computing cloud system comprises an intelligent hardware platform, an intelligent computing cloud operating system, application environment development, a big data platform and an intelligent application PaaS platform. In the intelligent hardware platform, based on the intelligent computing theory, the deep learning chip, the AI intelligent accelerator card and the distributed server can be integrated into the intelligent hardware platform, so that basic hardware support is provided for the whole super computing platform and related derivative platforms, and the main content of the intelligent hardware platform comprises the following four parts: the intelligent computing system comprises an intelligent computing subsystem, a data storage subsystem, an intelligent computing cloud operating system and a support management subsystem.
The embodiment of the application provides a distributed fault restoration method for constructing a low-energy-consumption arithmetic unit.
For ease of illustration, only one thick line is shown in fig. 3, but not only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices described above.
Memory 1130 may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatil ememory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor 1110 may be a general-purpose processor, including an artificial intelligence processor, a graphics processor (Graphics Processing Unit, GPU), an artificial intelligence processor card (MACHINE LEARNING Unit, MLU), a central processing Unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Accordingly, the present application also provides a computer readable storage medium storing a computer program, where the computer program is executed to implement the steps executable by the electronic device in the above method embodiments.

Claims (6)

1. The distributed fault repairing method is characterized by being applied to an intelligent computing cloud operating system, wherein the intelligent computing cloud operating system is used for running and managing intelligent platform resources; the distributed fault repair method comprises the following steps:
Acquiring real-time operation data of the intelligent computing cloud operating system; the real-time operation data at least comprises: real-time node load information, node availability information and data access modes;
Performing stream segmentation processing on the real-time operation data to obtain a plurality of real-time operation data streams corresponding to the real-time operation data;
inputting a plurality of real-time operation data streams into a branch module of a stream type extensible anomaly detection model in parallel, and respectively executing dynamic fuzzy detection processing of the corresponding real-time operation data streams by the plurality of branch modules to obtain the anomaly operation data streams in the plurality of real-time operation data streams; wherein, a plurality of branch modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system;
Summarizing the abnormal operation data flow through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system;
Each branching module in the flow-type extensible anomaly detection model at least comprises: the system comprises a fuzzy tree structure construction layer, a path planning layer, a path integration layer, an anomaly detection layer and an output layer;
The parallel input of multiple real-time operation data flows into the branch module of the stream type extensible exception detection model, and the multiple branch modules respectively execute the dynamic fuzzy detection processing of the corresponding real-time operation data flows to obtain the exception operation data flows in the multiple real-time operation data flows, including:
circularly dividing the real-time operation data stream corresponding to the current branch module through a fuzzy tree structure construction layer to construct a fuzzy tree structure network corresponding to the current branch module; the fuzzy tree structure network comprises a root grid and She Wangge;
acquiring path planning data which is experienced by a root grid reaching each leaf grid in a fuzzy tree structure network through a path planning layer;
calculating a path length of each leaf grid based on the path planning data by a path integration layer;
Determining an anomaly score of each leaf grid based on the path length of each leaf grid through an anomaly detection layer, and taking the leaf grid with the anomaly score larger than a dynamic anomaly threshold as an anomaly grid; the dynamic abnormal threshold is obtained through self-adaptive evaluation of real-time load conditions and/or node availability in the intelligent computing cloud operating system;
through an output layer, taking the real-time operation data stream corresponding to the abnormal grid as the abnormal operation data stream;
the method for circularly dividing the real-time operation data stream corresponding to the current branch module through the fuzzy tree structure construction layer to construct the fuzzy tree structure network corresponding to the current branch module comprises the following steps:
receiving a real-time operation data stream corresponding to a current branch module;
randomly selecting any axis as a dividing axis in an initial cycle period, and dividing the real-time operation data stream by the selected dividing axis to obtain two first sub-data streams;
In the second cycle, re-selecting another arbitrary axis as a dividing axis, and dividing the two first sub-data streams respectively by the selected dividing axis to obtain two second sub-data streams respectively corresponding to the two first sub-data streams;
repeating the cycle period, and cutting each sub-data stream until the cutting stopping condition is reached, so as to obtain the fuzzy tree structure network.
2. The distributed fault restoration method according to claim 1, wherein the randomly selecting an arbitrary axis as the split axis or the re-selecting another arbitrary axis as the split axis includes:
mapping the real-time operation data stream corresponding to the current branch module to a data characteristic group image space to form a real-time operation data characteristic group;
randomly selecting a segmentation position and a segmentation angle based on the distribution condition of data characteristic elements in the real-time operation data characteristic group, and recording the selected segmentation position and the segmentation angle to form a segmentation axis;
wherein, any axis is one of any linear axis, nonlinear axis and multi-type combined axis in the data characteristic group image space.
3. The distributed fault repair method of claim 1, wherein the dynamic fault repair model comprises at least: an anomaly locating layer, an anomaly identifying layer and a dynamic repairing layer;
Summarizing the abnormal operation data stream through a dynamic fault repair model, and dynamically repairing the intelligent computing cloud operating system, wherein the method comprises the following steps:
Receiving the abnormal operation data stream in real time through an abnormal positioning layer, and taking a data source node corresponding to the abnormal operation data stream as a candidate abnormal node;
Judging whether the candidate abnormal node is an abnormal node or not based on the historical operation data of the candidate abnormal node, the network topological structure and the real-time processing data flow through an abnormal recognition layer;
and starting the repair processing of the abnormal node in the intelligent computing cloud operating system through the dynamic repair layer so as to recover the abnormal node to a healthy running state.
4. The distributed fault repair method according to claim 3, wherein the step of starting repair processing for the abnormal node in the intelligent computing cloud operating system through the dynamic repair layer to restore the abnormal node to a healthy operating state comprises:
Judging the abnormal type of the abnormal node;
If the abnormal node is a redundant node, removing or isolating the node with the abnormal frequency higher than the set upper limit of the available based on a preset dynamic redundancy repair strategy; or alternatively
If the data in the abnormal node is abnormal, increasing the number of data copies of the abnormal node, and storing the data copies to the expanded backup node; the backup node is a node with a healthy state and a healthy state in a data access mode; or alternatively
If a plurality of adjacent abnormal nodes exist, the network topology structure of the local network where the abnormal nodes are located is optimized, or a collective isolation operation is carried out on the plurality of adjacent abnormal nodes, and a backup node cluster is established for taking over the data processing task of the abnormal nodes which are isolated in a collective manner.
5. An intelligent computing cloud operating system for running and managing intelligent platform resources and implementing the distributed fault remediation method of claim 1, the intelligent computing cloud operating system comprising:
the acquisition unit is configured to acquire real-time operation data of the intelligent computing cloud operating system; the real-time operation data at least comprises: real-time node load information, node availability information and data access modes;
The flow dividing unit is configured to perform flow dividing processing on the real-time operation data to obtain a plurality of real-time operation data flows corresponding to the real-time operation data;
The detection unit is configured to input a plurality of real-time operation data streams into the branch modules of the stream type extensible exception detection model in parallel, and the plurality of branch modules respectively execute dynamic fuzzy detection processing of the corresponding real-time operation data streams so as to obtain exception operation data streams in the plurality of real-time operation data streams; wherein, a plurality of branch modules are distributed and deployed in different processing nodes of the intelligent computing cloud operating system;
and the repair unit is configured to summarize the abnormal operation data flow through a dynamic fault repair model and dynamically repair the intelligent computing cloud operating system.
6. An intelligent computing platform, the intelligent computing platform comprising:
At least one processor, memory, and input output unit;
Wherein the memory is for storing a computer program and the processor is for invoking the computer program stored in the memory to perform the distributed fault remediation method of any one of claims 1 to 4.
CN202410416084.0A 2024-04-08 2024-04-08 Distributed fault restoration method, intelligent computing cloud operating system and computing platform Active CN118012662B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410416084.0A CN118012662B (en) 2024-04-08 2024-04-08 Distributed fault restoration method, intelligent computing cloud operating system and computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410416084.0A CN118012662B (en) 2024-04-08 2024-04-08 Distributed fault restoration method, intelligent computing cloud operating system and computing platform

Publications (2)

Publication Number Publication Date
CN118012662A CN118012662A (en) 2024-05-10
CN118012662B true CN118012662B (en) 2024-06-18

Family

ID=90950249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410416084.0A Active CN118012662B (en) 2024-04-08 2024-04-08 Distributed fault restoration method, intelligent computing cloud operating system and computing platform

Country Status (1)

Country Link
CN (1) CN118012662B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111207938A (en) * 2020-01-13 2020-05-29 西南交通大学 Railway vehicle fault detection method
CN116068983A (en) * 2021-11-03 2023-05-05 中移雄安信息通信科技有限公司 System abnormity monitoring method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826971B (en) * 2022-06-28 2022-12-27 苏州浪潮智能科技有限公司 Server abnormity detection method, device, equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111207938A (en) * 2020-01-13 2020-05-29 西南交通大学 Railway vehicle fault detection method
CN116068983A (en) * 2021-11-03 2023-05-05 中移雄安信息通信科技有限公司 System abnormity monitoring method and device

Also Published As

Publication number Publication date
CN118012662A (en) 2024-05-10

Similar Documents

Publication Publication Date Title
US8521782B2 (en) Methods and systems for processing large graphs using density-based processes using map-reduce
CN109144813B (en) System and method for monitoring server node fault of cloud computing system
CN111818159A (en) Data processing node management method, device, equipment and storage medium
CN102317910A (en) Methods, devices and system for virtual data backup and reintegration
CN112513815A (en) Training data center hardware instance network
CN102402395A (en) Quorum disk-based non-interrupted operation method for high availability system
CN116701043B (en) Heterogeneous computing system-oriented fault node switching method, device and equipment
EP3956771B1 (en) Timeout mode for storage devices
CN113821332B (en) Method, device, equipment and medium for optimizing efficiency of automatic machine learning system
CN110784539A (en) Data management system and method based on cloud computing
CN110716875A (en) Concurrency test method based on feedback mechanism in domestic office environment
CN110580198A (en) Method and device for adaptively switching OpenStack computing node into control node
US20230053575A1 (en) Partitioning and placement of models
CN117827788B (en) Intelligent 3D printing factory data processing method and system
CN117851257A (en) Distributed software testing environment construction system based on cloud computing
CN114706675A (en) Task deployment method and device based on cloud edge cooperative system
CN118012662B (en) Distributed fault restoration method, intelligent computing cloud operating system and computing platform
Riabko et al. Cluster fault tolerance model with migration of virtual machines.
CN105487946A (en) Fault computer automatic switching method and device
CN111324513B (en) Monitoring management method and system for artificial intelligence development platform
WO2023154051A1 (en) Determining root causes of anomalies in services
CN107807608A (en) Data processing method, data handling system and storage medium
CN114598731A (en) Cluster log collection method, device, equipment and storage medium
Hajder et al. Reconfiguration of the multi-channel communication system with hierarchical structure and distributed passive switching
CN112380288A (en) Decentralized distributed data processing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant