WO2021114613A1 - 基于人工智能的故障节点识别方法、装置、设备和介质 - Google Patents

基于人工智能的故障节点识别方法、装置、设备和介质 Download PDF

Info

Publication number
WO2021114613A1
WO2021114613A1 PCT/CN2020/098772 CN2020098772W WO2021114613A1 WO 2021114613 A1 WO2021114613 A1 WO 2021114613A1 CN 2020098772 W CN2020098772 W CN 2020098772W WO 2021114613 A1 WO2021114613 A1 WO 2021114613A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
alarm
data
nodes
faulty
Prior art date
Application number
PCT/CN2020/098772
Other languages
English (en)
French (fr)
Inventor
陈桢博
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021114613A1 publication Critical patent/WO2021114613A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method, device, equipment and medium for identifying faulty nodes based on artificial intelligence.
  • a fault generated by a certain node may trigger multiple node alarms, and there may be a large number of associated alarms caused by multiple faulty nodes at each moment.
  • an artificial intelligence-based method, device, device, and medium for identifying faulty nodes are provided.
  • a method for identifying faulty nodes based on artificial intelligence including:
  • the graph data includes multiple nodes in the faulty system and the calling relationships among multiple nodes;
  • each alarm node group each alarm node to obtain a combination of each alarm node
  • the fault node in each alarm node combination is determined.
  • An artificial intelligence-based fault node identification device including:
  • the graph data acquisition module is used to acquire graph data corresponding to the faulty system.
  • the graph data includes multiple nodes in the faulty system and the calling relationships among multiple nodes;
  • Node data acquisition module used to acquire node data of each node in the faulty system
  • the alarm node and the initial detection result generation module are used to determine each alarm node in the fault system according to the node data of each node, and according to the calling relationship between each alarm node and multiple nodes, each alarm node in the fault system is obtained as a fault
  • the alarm node combination determination module is used to group each alarm node according to the calling relationship between multiple nodes to obtain each alarm node combination;
  • the fault node determination module is used to determine the fault node in each alarm node combination according to each alarm node combination and the initial detection result of each alarm node.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the graph data includes multiple nodes in the faulty system and the calling relationships among multiple nodes;
  • each alarm node group each alarm node to obtain a combination of each alarm node
  • the fault node in each alarm node combination is determined.
  • One or more computer-readable storage media storing computer-readable instructions.
  • the one or more processors perform the following steps:
  • the graph data includes multiple nodes in the faulty system and the calling relationships among multiple nodes;
  • each alarm node group each alarm node to obtain a combination of each alarm node
  • the fault node in each alarm node combination is determined.
  • Fig. 1 is an application scenario diagram of a method for identifying faulty nodes based on artificial intelligence in one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for identifying faulty nodes based on artificial intelligence in one or more embodiments.
  • Fig. 3 is a schematic diagram of graph data according to one or more embodiments.
  • Fig. 4 is a schematic flowchart of a step of obtaining node data according to one or more embodiments.
  • Fig. 5 is a structural block diagram of an artificial intelligence-based fault node identification device according to one or more embodiments.
  • Fig. 6 is an internal structure diagram of a computer device according to one or more embodiments.
  • the artificial intelligence-based fault node identification method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 obtains the graph data corresponding to the faulty system, where the graph data includes multiple nodes in the faulty system and the calling relationship between multiple nodes, and then the server 104 obtains the node data of each node in the faulty system, and then according to each
  • the node data of the node determines each alarm node in the fault system, and according to the call relationship between each alarm node and multiple nodes, the initial detection result that each alarm node in the fault system is the fault node is obtained.
  • the server 104 groups the alarm nodes according to the calling relationship between the multiple nodes to obtain the alarm node combinations, and then determines the fault in each alarm node combination according to the alarm node combinations and the initial detection results of each alarm node node. Then, the server 104 outputs the faulty node to the terminal 102 to display and instruct the user through the terminal 102.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • FIG. 2 a schematic flow chart of a method for identifying faulty nodes based on artificial intelligence is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • Step S202 Obtain graph data corresponding to the faulty system.
  • the graph data includes multiple nodes in the faulty system and call relationships between multiple nodes.
  • a faulty system refers to a system that has alarm phenomena or alarms.
  • Graph data refers to data that contains multiple nodes in a faulty system and the calling relationships between nodes.
  • the graph data can be represented by an adjacency matrix, that is, the graph data can be represented by an n*n matrix. n represents the number of nodes. In the adjacency matrix, if there is a call relationship between two nodes, the corresponding matrix element is 1, and there is no call relationship, and the matrix element is 0.
  • Nodes refer to processing nodes in the system data processing process, such as host nodes, network nodes, etc. There is a mutual calling relationship between nodes.
  • an alarm message can be generated and sent to the alarm system to warn the failure.
  • the server can periodically process the alarm information received by the alarm system, and determine the corresponding faulty system according to the alarm number or system number in the alarm information, and then obtain the corresponding graph containing the call relationship between nodes from the faulty system Data, and carry out subsequent processing.
  • Step S204 Obtain node data of each node in the faulty system.
  • Node data refers to the data corresponding to each node in the faulty system.
  • the node data may include node alarm data and node basic data.
  • Node alarm data may include, but is not limited to, data such as alarm type, alarm frequency, and alarm level.
  • the basic data of the node may include the type of the node, such as the host or the network, and the node level information, such as the level of the node calling relationship.
  • the server may correspondingly obtain the node data of each node after obtaining the graph data.
  • Step 206 Determine each alarm node in the faulty system according to the node data of each node, and obtain an initial detection result that each alarm node in the faulty system is a faulty node according to the calling relationship between each alarm node and multiple nodes.
  • the alarm node refers to the node that alarms the failure problem when there is a failure problem in the system.
  • the faulty node refers to the node that caused the fault problem.
  • the alarm node is not necessarily a faulty node, and the alarm node may be an associated node that has a call relationship with the faulty node.
  • the server can determine whether each node is an alarm node based on the node data of each node. For example, based on the node alarm data in the node data, it can determine whether the node is an alarm node.
  • the server determines the initial detection result of each node in the alarm node as a faulty node according to the calling relationship between the nodes.
  • the initial detection result can be a probability value, that is, the server can determine the probability value that the alarm node is a faulty node according to the calling relationship between each alarm node and other nodes. For example, if a certain alarm node has a calling relationship with multiple other nodes, and the other multiple nodes are all non-alarm nodes, the probability that the alarm node is a faulty node is higher.
  • Step 208 Group each alarm node according to the calling relationship between the multiple nodes to obtain a combination of each alarm node.
  • the alarm node combination refers to the combination of nodes that alarm the same fault problem. For example, for fault problem a, node A, node B, and node C all have an alarm, then node A, node B, and node C can be divided into an alarm node combination.
  • the server can group the alarm nodes according to the calling relationship between the nodes in the graph data, and obtain the alarm node combination corresponding to each fault problem.
  • Step 210 Determine the faulty node in each alarm node combination according to each alarm node combination and the initial detection result of each alarm node.
  • the server after the server determines the alarm node combination and the initial detection result of each alarm node, it can determine the probability value of each alarm node being a faulty node based on the initial detection results of multiple alarm nodes in the alarm node combination.
  • the faulty node that caused the fault is the server determines the alarm node combination and the initial detection result of each alarm node.
  • the server may sort the probability values of each alarm node, such as sorting in descending order, and determine the faulty node in the alarm node combination according to the sorting result.
  • the server may directly determine the alarm node with the largest probability value as the faulty node based on the probability value.
  • the graph data includes multiple nodes in the fault system and the calling relationship between multiple nodes, and then the node data of each node in the fault system is obtained,
  • each alarm node in the fault system is determined, and according to the call relationship between each alarm node and multiple nodes, the initial detection result that each alarm node in the fault system is a fault node is obtained, and further according to multiple nodes
  • the call relationship between each alarm node is grouped to obtain each alarm node combination, and the fault node in each alarm node combination is determined according to each alarm node combination and the initial detection result of each alarm node.
  • the alarm node combination can be determined according to the graph data containing the node call relationship and the node data of each node to determine the combination of each alarm node corresponding to each fault problem, and then according to the initial detection result of the alarm node, from each node Determining the faulty node in the alarm node combination can make the identification and judgment of the faulty node more intelligent than manual query and screening of the faulty node, and improve the intelligent level of data processing.
  • determining each alarm node in the faulty system according to the node data of each node may include: extracting the characteristic data of each node data to obtain the node characteristic corresponding to each node; determining the node standard characteristic of each node , The standard feature of the node is the feature extracted based on the node data of the node in the non-alarm state; the node standard feature and the node feature of each node are matched to obtain each alarm node in the fault system.
  • the node feature refers to the feature corresponding to each node.
  • the node feature can correspond to the node one-to-one. If the node is different, the corresponding node feature is different, such as the node feature corresponding to the host node, the node feature corresponding to the network node, etc.
  • the standard feature of the node refers to the feature extracted based on the node data of the node in the non-alarm state, and can include the feature extracted from the node data under the normal operating condition of the node or the operating condition of the allowable error range.
  • the server can extract the corresponding node characteristics from the node data, for example, for the host, host alarm type, host alarm frequency, host alarm level and other characteristic data, and then match the node standard features of the node. To determine whether each node is an alarm node. For example, in the standard feature of a node, the alarm frequency is twice an hour (understood as an alarm within the normal error range), and the node feature extracted from the node data is once every 5 minutes, then the node can be determined as an alarm through matching node.
  • the alarm frequency is twice an hour (understood as an alarm within the normal error range)
  • the node feature extracted from the node data is once every 5 minutes
  • the server can also perform a weighted summation after matching the characteristics of multiple nodes, and then determine whether the node is an alarm node based on the weighted summation result and a preset threshold.
  • the preset threshold is 0.5
  • the weighted summation After the result is greater than 0.5 the node can be determined as an alarm node, and if it is less than or equal to 0.5, it can be determined as a non-alarm node.
  • the alarm node is determined. Therefore, it can be determined whether each node is an alarm node according to the real-time node data of each node, and the alarm node determination can be improved The accuracy of, in turn, can make the identification and judgment of the faulty node more accurate.
  • grouping the alarm nodes according to the calling relationship between multiple nodes to obtain the alarm node combination may include: determining any two alarms according to the calling relationship between multiple nodes in the faulty system The node distance between nodes; with any alarm node as the starting alarm node, determine the associated alarm node whose node distance from the starting alarm node is less than or equal to the node threshold distance; use the associated alarm node as the starting alarm node and continue Determine the associated alarm node whose node distance from the associated alarm node is less than or equal to the node threshold distance; divide the initial alarm node and the corresponding associated alarm node into the same alarm node combination.
  • the node distance refers to the distance between the alarm node and the alarm node.
  • the node distance can be different according to the number of non-alarm nodes between the alarm node and the alarm node.
  • node A and node B, node A and node E respectively have a calling relationship
  • node B and node C have a calling relationship
  • node C and node D have a calling relationship
  • node A, node C, and node E are all It is an alarm node
  • node B and node D are normal nodes. Then the server can determine that the node distance between node A and node C is 2 (the non-alarm node B is separated), and the node distance between node A and node E is 1.
  • the server may set the node threshold distance to 1, that is, set the node threshold for grouping alarm nodes to 1. Then, the server uses any alarm node as the starting alarm node, such as node A as the starting alarm node, and then finds the associated alarm node with a node distance less than or equal to 1, that is, the associated alarm node of node A can be found as node E. Then, the server can obtain the combination of the alarm nodes into node A and node E according to the initial alarm node and the determined associated alarm node.
  • the alarm node combination can also be expressed as an alarm node cluster.
  • the alarm node combination node A and node E can be expressed as an alarm node cluster [A, E].
  • the server when the server sets the node threshold distance to 2, the server continues to start with node A as the alarm node, and then finds the associated alarm node whose node distance is less than or equal to 2, then the server can find both node E and node C It is the associated alarm node of node A. Then the server can get the alarm node combination as node A, node C and node E, or it can be expressed as alarm node cluster [A, C, E].
  • node A, node B, node C, node D, and node E are all alarm nodes, and the server sets the node threshold distance to 1.
  • the server continues to use node A as the starting alarm node. According to the threshold distance of the node, it can first determine the associated alarm nodes as node B and node E. Then, the server uses node B as the starting alarm node, and determines that the associated alarm node is node C.
  • the server can further determine that the associated alarm node of node C is node D, so that the alarm node combination is node A, node B, node C, node D, and node E, or can also be expressed as an alarm node cluster [A, B, C, D, E].
  • the associated alarm node corresponding to the initial alarm node is determined, and the alarm node combination is generated, so that the associated alarm node can be accurately determined according to the node distance, and the associated alarm is improved. Accuracy of node determination.
  • obtaining node data of each node in the faulty system may include:
  • Step S402 Obtain the original alarm data of the fault system collected by the alarm system
  • Alarm raw data refers to the data directly obtained from the alarm system, such as the alarm record data in the alarm system.
  • node alarm data is extracted from the original alarm data to obtain node alarm data of each node.
  • the node alarm data may include at least one of the alarm type, alarm frequency, and alarm level of each node.
  • the node alarm data can include but is not limited to data such as alarm type, alarm frequency, and alarm level.
  • the server may extract the alarm original data to extract node alarm data from the obtained alarm original data.
  • the server may extract node alarm data from the original alarm data according to a preset extraction template.
  • different types of nodes may have different extraction templates, and the extracted node alarm data may be different.
  • the server may also analyze and process the extracted data to obtain node alarm data corresponding to each node. For example, after the alarm data of the host is obtained from the alarm record, the alarm frequency of the host is obtained by statistical analysis of the number of alarms of the host.
  • Step S406 Obtain node basic data of each node.
  • the node basic data may include at least one of a node type and a node hierarchy.
  • the server can directly obtain the node basic data of each node from the database of the faulty system.
  • step S408 node data of each node is generated according to the alarm data of each node and the basic data of the node.
  • the server can combine the data of the same node to obtain node data corresponding to each node.
  • the node alarm data is obtained from the original alarm data, and the basic data of the node is obtained, and then the node data is generated, so that the node data contains the various characteristics of the node, so that the subsequent judgment of the alarm node is more accurate, and then The accuracy of identifying and determining faulty nodes can be improved.
  • each alarm node in the faulty system is determined according to the node data of each node, and according to the call relationship between each alarm node and multiple nodes, the initial detection that each alarm node in the faulty system is the faulty node is obtained The result is determined by the pre-trained graph convolutional neural network model.
  • the training method of the graph convolutional neural network model can include: obtaining training sample data, which includes training graph data and node training data for each node; Annotate each node in the data, and obtain the training graph data marked with each node as an alarm node, a non-alarm node, a fault node, and a non-fault node; input the labeled training graph data and training sample data into the constructed initial graph convolutional nerve
  • the network model uses the initial graph convolutional neural network model to perform feature extraction on the training sample data to obtain feature data; perform regression prediction on the feature data to obtain the prediction results of each node as a faulty node and a non-faulty node; based on the prediction results and after labeling Determine the loss value of the initial graph convolutional neural network model, and update the model parameters of the initial graph convolutional neural network model through the loss value; perform iterative processing on the initial graph convolutional neural network model to obtain the trained graph Convolutional neural network model.
  • the server may use graph data and historical node data of different systems as training sample data. Then the server uses the labeling tool to label each node in the training graph data according to the node training data of each node.
  • LabelImg can be used to label alarm nodes, non-alarm nodes, faulty nodes, and non-faulty nodes respectively.
  • the server may perform normalization processing on the training image data to obtain normalized training image data.
  • the server inputs the normalized training graph data and the node training data into the initial graph convolutional neural network model constructed, uses the initial graph convolutional neural network model to extract node features, and determines the alarm node and the alarm node based on the extracted node features.
  • Non-alarm node
  • the server may perform quantization processing on the node training data of each node, such as quantizing the alarm type, alarm level, etc., to obtain the quantized node training data.
  • the server performs node prediction based on the determined alarm node, non-alarm node, node training data of each node, and the calling relationship between the nodes in the training graph data, and obtains the prediction result that each node is a faulty node.
  • the graph neural network model can predict the probability that each node is a faulty node through a calculation formula.
  • the specific calculation formula is as follows:
  • the server may calculate the loss value of the initial graph convolutional neural network model through the loss function according to the predicted result and the marked result.
  • the loss value of the model is calculated by the cross-entropy loss function, or it can also be the L1 loss function and/or the L2 loss function, etc., which is not limited.
  • the server can iteratively process the initial graph convolutional neural network model according to the preset learning rate and the calculated loss value, and continuously update the parameters of the model to obtain the trained graph convolutional neural network model.
  • each alarm node in the fault system is determined by the trained graph convolutional neural network model, and the initial detection result that each alarm node in the fault system is the fault node is obtained, thereby, the identification of the alarm node and the initial detection result can be improved
  • the accuracy of the judgment further improves the accuracy of determining the faulty node.
  • At least one of the graph data and the node data is uploaded to the blockchain and stored in the nodes of the blockchain.
  • Blockchain refers to a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain (Blockchain) is essentially a decentralized database. It is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of its information (anti-counterfeiting) and the generation of the next block.
  • the blockchain may include the underlying blockchain platform, the platform product service layer, and the application service layer.
  • the server can upload and store one or more of the graph data and node data in the nodes of the blockchain to ensure the privacy and security of the data.
  • the privacy of the data stored in the nodes of the blockchain can be guaranteed, and the security of the data can be improved. Sex.
  • an artificial intelligence-based fault node identification device which includes: a graph data acquisition module 100, a node data acquisition module 200, an alarm node and an initial detection result generation module 300, The alarm node combination determination module 400 and the fault node determination module 500, wherein:
  • the graph data acquisition module 100 is used to acquire graph data corresponding to a faulty system.
  • the graph data includes multiple nodes in the faulty system and call relationships between multiple nodes.
  • the node data obtaining module 200 is used to obtain node data of each node in the faulty system.
  • the alarm node and the initial detection result generation module 300 is used to determine each alarm node in the fault system according to the node data of each node, and according to the call relationship between each alarm node and multiple nodes, obtain the alarm node in the fault system as The initial detection result of the failed node.
  • the alarm node combination determination module 400 is used for grouping the alarm nodes according to the calling relationship between multiple nodes to obtain the alarm node combinations.
  • the fault node determination module 500 is used to determine the fault node in each alarm node combination according to each alarm node combination and the initial detection result of each alarm node.
  • the alarm node and the initial detection result generation module 300 may include:
  • the extraction sub-module is used to extract the feature data of each node data to obtain the node feature corresponding to each node.
  • the node standard feature determination sub-module is used to determine the node standard feature of each node.
  • the node standard feature is the feature extracted based on the node data of the node in the non-alarm state.
  • the matching sub-module is used to match the node standard features and node features of each node to obtain each alarm node in the fault system.
  • the alarm node combination determination module 400 may include:
  • the node distance determination sub-module is used to determine the node distance between any two alarm nodes according to the calling relationship between multiple nodes in the fault system.
  • the associated fault node determination sub-module is used to use any alarm node as the initial alarm node to determine the associated alarm node whose node distance from the initial alarm node is less than or equal to the node threshold distance.
  • the cyclic sub-module is used to use the associated alarm node as the initial alarm node, and continue to determine the associated alarm node whose node distance from the associated alarm node is less than or equal to the node threshold distance.
  • the alarm node combination determines the sub-module, which is used to divide the initial alarm node and the corresponding associated alarm node into the same alarm node combination.
  • the node data acquisition module 200 may include:
  • the alarm raw data acquisition sub-module is used to obtain the alarm raw data of the fault system collected by the alarm system.
  • the node alarm data generation sub-module is used to extract the node alarm data from the original alarm data to obtain the node alarm data of each node.
  • the node alarm data includes at least one of the alarm type, alarm frequency and alarm level of each node.
  • the node basic data acquisition sub-module is used to acquire the node basic data of each node, and the node basic data includes at least one of a node type and a node level.
  • the node data generation sub-module is used to generate the node data of each node according to the alarm data of each node and the basic data of the node.
  • the alarm node and the initial detection result generation module 300 determines each alarm node in the fault system according to the node data of each node, and obtains the fault system according to the call relationship between each alarm node and multiple nodes
  • the initial detection result that each alarm node is a faulty node can be determined by a pre-trained graph convolutional neural network model.
  • the above-mentioned device may further include: a model training module for training the graph convolutional neural network model.
  • model training module may include:
  • the training sample data acquisition sub-module is used to acquire training sample data.
  • the training sample data includes training graph data and node training data of each node.
  • the labeling sub-module is used to label each node in the training graph data to obtain training graph data marked with each node as an alarm node, a non-alarm node, a faulty node, and a non-faulty node.
  • the feature extraction sub-module is used to input the labeled training graph data and training sample data into the constructed initial graph convolutional neural network model, and perform feature extraction on the training sample data through the initial graph convolutional neural network model to obtain feature data.
  • the regression prediction sub-module is used to perform regression prediction on the characteristic data, and obtain the prediction results of each node as a fault node and a non-fault node.
  • the loss calculation sub-module is used to determine the loss value of the initial graph convolutional neural network model based on the prediction result and the labeled training graph data, and update the model parameters of the initial graph convolutional neural network model through the loss value.
  • the iterative processing sub-module is used to iteratively process the initial graph convolutional neural network model to obtain the trained graph convolutional neural network model.
  • the above-mentioned device may further include:
  • the upload storage module is used to upload at least one of the graph data and the node data to the blockchain and store it in the nodes of the blockchain.
  • Each module in the above artificial intelligence-based fault node identification device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile or volatile storage medium and internal memory.
  • the non-volatile or volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store graph data, node data and other data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer readable instructions.
  • the one or more processors perform the following steps: Obtain the corresponding Graph data, graph data includes multiple nodes in the fault system and the calling relationship between multiple nodes; obtain node data of each node in the fault system; determine each alarm node in the fault system according to the node data of each node, and according to each node Alarm node and the calling relationship between multiple nodes to obtain the initial detection result that each alarm node in the faulty system is a fault node; According to the calling relationship between multiple nodes, each alarm node is grouped to obtain a combination of alarm nodes; And according to each alarm node combination and the initial detection result of each alarm node, determine the fault node in each alarm node combination.
  • the determination of each alarm node in the faulty system according to the node data of each node may include: extracting characteristic data of each node data to obtain the node corresponding to each node Features; determine the standard features of each node, where the standard feature of the node is the feature extracted based on the node data of the node in a non-alarm state; and the standard feature and node feature of each node are matched to obtain each of the faulty system Alarm node.
  • the alarm nodes are grouped according to the calling relationship between the multiple nodes to obtain the alarm node combination, which may include: according to the relationship between multiple nodes in the faulty system To determine the node distance between any two alarm nodes; take any alarm node as the starting alarm node, and determine the associated alarm node whose node distance from the starting alarm node is less than or equal to the node threshold distance; the alarm will be associated
  • the node serves as the initial alarm node, and continues to determine the associated alarm node whose node distance from the associated alarm node is less than or equal to the node threshold distance; and divides the initial alarm node and the corresponding associated alarm node into the same alarm node combination.
  • acquiring the node data of each node in the fault system when the processor executes the computer-readable instructions may include: acquiring the alarm raw data of the fault system collected by the alarm system; performing node alarm data analysis on the alarm raw data Extract and obtain the node alarm data of each node.
  • the node alarm data includes at least one of the alarm type, alarm frequency and alarm level of each node; obtain the node basic data of each node.
  • the node basic data includes at least one of the node type and the node level. Species; and generate node data for each node according to the alarm data of each node and the basic data of the node.
  • the processor when the processor executes computer-readable instructions, it can determine each alarm node in the faulty system according to the node data of each node, and obtain the faulty system according to the calling relationship between each alarm node and multiple nodes
  • the initial detection result of each alarm node being a faulty node is determined by the pre-trained graph convolutional neural network model.
  • the training method of the graph convolutional neural network model may include: obtaining training sample data, which includes training graph data and various Node training data of the node; label each node in the training graph data, and obtain the training graph data labeled with each node as an alarm node, a non-alarm node, a faulty node, and a non-faulty node; the labeled training graph data and training samples
  • the initial graph convolutional neural network model constructed by the data input is used to extract the features of the training sample data through the initial graph convolutional neural network model to obtain feature data; perform regression prediction on the feature data to obtain the faulty node and the non-faulty node.
  • Prediction results based on the prediction results and the labeled training image data, determine the loss value of the initial image convolutional neural network model, and update the model parameters of the initial image convolutional neural network model through the loss value; and for the initial image convolutional neural network
  • the model is processed iteratively, and the trained graph convolutional neural network model is obtained.
  • the processor further implements the following steps when executing the computer-readable instructions: upload at least one of the graph data and the node data to the blockchain and store it in the nodes of the blockchain.
  • One or more computer-readable storage media storing computer-readable instructions.
  • the one or more processors perform the following steps: obtaining graph data corresponding to the faulty system, The graph data includes multiple nodes in the fault system and the calling relationship between multiple nodes; obtain the node data of each node in the fault system; determine each alarm node in the fault system according to the node data of each node, and according to each alarm node and The call relationship between multiple nodes is used to obtain the initial detection result that each alarm node in the faulty system is a fault node; according to the call relationship between multiple nodes, each alarm node is grouped to obtain a combination of alarm nodes; and The alarm node combination and the initial detection results of each alarm node determine the faulty node in each alarm node combination.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the determination of each alarm node in the faulty system according to the node data of each node may include: extracting the characteristic data of each node data to obtain the corresponding node data. Node characteristics; determine the node standard characteristics of each node, where the node standard characteristics are the characteristics extracted based on the node data of the node in a non-alarm state; and the node standard characteristics and node characteristics of each node are matched to obtain the faulty system Each alarm node.
  • the alarm nodes are grouped according to the calling relationship between multiple nodes to obtain the alarm node combination, which may include: according to the number of nodes in the faulty system Determine the node distance between any two alarm nodes; use any alarm node as the starting alarm node to determine the associated alarm node whose node distance from the starting alarm node is less than or equal to the node threshold distance; associate The alarm node serves as the initial alarm node, and continues to determine the associated alarm node whose node distance from the associated alarm node is less than or equal to the node threshold distance; and divides the initial alarm node and the corresponding associated alarm node into the same alarm node combination.
  • obtaining the node data of each node in the fault system when the computer-readable instructions are executed by the processor may include: obtaining the alarm raw data of the fault system collected by the alarm system; performing node alarm data on the alarm raw data The node alarm data of each node is obtained.
  • the node alarm data includes at least one of the alarm type, alarm frequency and alarm level of each node; the node basic data of each node is obtained, and the node basic data includes at least the node type and the node level.
  • One and generate node data for each node according to the alarm data of each node and the basic data of the node.
  • each alarm node in the faulty system is determined according to the node data of each node, and the faulty system is obtained according to the calling relationship between each alarm node and multiple nodes
  • the initial detection result in which each alarm node is a faulty node is determined by a pre-trained graph convolutional neural network model.
  • the training method of the graph convolutional neural network model may include: obtaining training sample data, which includes training graph data and The node training data of each node; label each node in the training graph data, and obtain the training graph data labeled with each node as an alarm node, a non-alarm node, a faulty node, and a non-faulty node; the labeled training graph data and training
  • determine the loss value of the initial image convolutional neural network model and update the model parameters of the initial image convolutional neural network model through the loss value; and for the initial image convolutional neural network
  • the network model is processed iteratively, and the trained graph convolutional neural network model is obtained.
  • the following steps are further implemented: upload at least one of the graph data and the node data to the blockchain and store it in the nodes of the blockchain.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种基于人工智能的故障节点识别方法,涉及人工智能领域,包括:获取故障系统对应的图数据,图数据包括故障系统中多个节点之间的调用关系;获取故障系统中各节点的节点数据;根据各节点的节点数据确定各报警节点,并根据各报警节点以及调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。同时,本申请还涉及区块链技术,其中,图数据以及所述节点数据等均可以存储与区块链中。

Description

基于人工智能的故障节点识别方法、装置、设备和介质
相关申请的交叉引用
本申请要求于2020年06月09日提交中国专利局,申请号为202010517479.1,申请名称为“基于人工智能的故障节点识别方法、装置、设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,特别是涉及一种基于人工智能的故障节点识别方法、装置、设备和介质。
背景技术
在运维系统中,某节点所产生的故障可能引发多个节点的报警,而每一时刻可能存在多个故障节点所引发的大量关联报警。
在传统方式中,通常通过运维人员为每一起故障找到引发故障问题的故障节点,以便于及时对故障节点进行修复。
但是,发明人意识到,通过运维人员人工进行关联报警的多个故障节点的查找,需要从大量原始数据开始分析,分析过程不够智能化。
发明内容
根据本申请公开的各种实施例,提供一种基于人工智能的故障节点识别方法、装置、设备和介质。
一种基于人工智能的故障节点识别方法,包括:
获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;
获取故障系统中各节点的节点数据;
根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;
根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及
根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
一种基于人工智能的故障节点识别装置,包括:
图数据获取模块,用于获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;
节点数据获取模块,用于获取故障系统中各节点的节点数据;
报警节点及初始检测结果生成模块,用于根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;
报警节点组合确定模块,用于根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及
故障节点确定模块,用于根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;
获取故障系统中各节点的节点数据;
根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;
根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及
根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;
获取故障系统中各节点的节点数据;
根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;
根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及
根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域 普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中基于人工智能的故障节点识别方法的应用场景图。
图2为根据一个或多个实施例中基于人工智能的故障节点识别方法的流程示意图。
图3为根据一个或多个实施例中图数据的示意图。
图4为根据一个或多个实施例中获取节点数据步骤的流程示意图。
图5为根据一个或多个实施例中基于人工智能的故障节点识别装置的结构框图。
图6为根据一个或多个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的基于人工智能的故障节点识别方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104进行通信。具体地,服务器104获取故障系统对应的图数据,其中,图数据包括故障系统中多个节点以及多个节点之间的调用关系,然后服务器104获取故障系统中各节点的节点数据,进而根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果。进一步,服务器104根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合,然后根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。然后,服务器104将故障节点输出给终端102,以通过终端102显示并指示用户。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种基于人工智能的故障节点识别方法的流程示意图,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤S202,获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系。
故障系统是指存在有报警现象或者报警警示的系统。
图数据是指包含有故障系统中多个节点以及节点之间调用关系的数据。在本实施例中,图数据可以通过邻接矩阵表示,即通过一个n*n的矩阵表示图数据。n表示节点的数量。在邻接矩阵中,两个节点存在调用关系,则对应的矩阵元素为1,不存在调用关系,矩阵元素为0。
节点是指系统数据处理过程中的各个处理节点,如主机节点、网络节点等。节点与节点之间存在相互的调用关系。
在本实施例中,各个服务系统在出现系统故障时,可以生成报警信息并发送给报警系 统,以对故障进行警示。服务器可以定期对报警系统接收到的报警信息进行处理,并根据报警信息中的报警编号或者系统编号等,确定对应的故障系统,然后从故障系统中获取对应的包含有节点之间调用关系的图数据,并进行后续的处理。
步骤S204,获取故障系统中各节点的节点数据。
节点数据是指故障系统中各个节点对应的数据。节点数据可以包括节点报警数据以及节点基础数据。节点报警数据可以包括但不限于报警类型、报警频次以及报警级别等数据。节点基础数据可以包括节点的类型,如主机或者是网络等类型,以及节点层次信息,如节点调用关系的层级等。
在本实施例中,服务器可以在获取图数据后,对应获取各节点的节点数据。
步骤206,根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果。
报警节点是指当系统中存在故障问题时,对该故障问题进行报警的节点。故障节点是指引发故障问题的节点。在本实施例中,报警节点并非一定是故障节点,报警节点可以是与故障节点存在调用关系的关联节点。
在本实施例中,服务器可以根据各节点的节点数据,确定各节点是否为报警节点,例如,根据节点数据中的节点报警数据,则可以确定节点是否为报警节点。
进一步,服务器根据节点之间的调用关系,确定报警节点中各节点为故障节点的初始检测结果。
在本实施例中,初始检测结果可以是一个概率值,即服务器可以根据各报警节点与其他节点之间的调用关系,确定报警节点为故障节点的概率值。例如,如某个报警节点与其他多个节点存在调用关系,且其他多个节点均为非报警节点,则该报警节点为故障节点的概率较大。
步骤208,根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合。
报警节点组合是指对同一故障问题进行报警的节点的组合。例如,故障问题a,节点A、节点B、节点C均进行了报警,则节点A、节点B、节点C可以划分为一个报警节点组合。
在本实施例中,服务器可以根据图数据中节点与节点之间的调用关系,对报警节点进行分组,得到对应各故障问题的报警节点组合。
步骤210,根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
在本实施例中,服务器在确定报警节点组合以及各报警节点的初始检测结果后,可以根据报警节点组合中多个报警节点的初始检测结果,如,各报警节点为故障节点的概率值,确定引发故障的故障节点。
具体地,服务器可以对各报警节点的概率值进行排序,如降序排序,并根据排序结果 确定报警节点组合中的故障节点。或者,服务器也可以直接根据概率值,确定概率值最大的报警节点为故障节点。
上述基于人工智能的故障节点识别方法中,通过获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系,然后获取故障系统中各节点的节点数据,根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果,进一步根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合,根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。从而,可以根据包含有节点调用关系的图数据以及各节点的节点数据,进行报警节点组合的确定,以确定对应于各故障问题的各报警节点组合,然后根据报警节点的初始检测结果,从各报警节点组合中确定故障节点,相比于人工进行故障节点的查询以及筛选,可以使得故障节点的识别判定更加智能化,提升数据处理的智能化水平。
在其中一个实施例中,根据各节点的节点数据确定故障系统中的各报警节点,可以包括:对各节点数据进行特征数据的提取,得到各节点对应的节点特征;确定各节点的节点标准特征,节点标准特征为基于节点在非报警状态下的节点数据提取的特征;对各节点的节点标准特征以及节点特征进行匹配,得到故障系统中的各报警节点。
节点特征是指与各节点对应的特征,节点特征可以与节点一一对应,节点不同,其对应的节点特征不同,如主机节点对应的节点特征,网络节点对应的节点特征等。
节点标准特征是指基于节点在非报警状态下的节点数据提取得到的特征,可以包括节点在正常运行情况下或者是允许误差范围运行情况下的节点数据中提取的特征。
在本实施例中,服务器可以从节点数据中,提取出对应的节点特征,例如,对于主机,主机报警类型、主机报警频率、主机报警级别等特征数据,然后与节点的节点标准特征进行匹配,以确定各节点是否为报警节点。例如,节点标准特征中,报警频率为一小时2次(理解为正常误差范围内的报警),而从节点数据中提取的节点特征为5分钟1次,则通过匹配,可以确定该节点为报警节点。
或者,服务器也可以在对多个节点特征进行匹配后,并进行加权求和,然后根据加权求和结果以及预设阈值,确定节点是否为报警节点,例如,预设阈值为0.5,加权求和后结果大于0.5,则可以确定节点为报警节点,若小于或等于0.5则确定为非报警节点。
上述实施例中,通过对节点数据进行特征数据的提取并与节点标准特征进行匹配,进而确定报警节点,从而,可以根据各节点的实时节点数据确定各节点是否为报警节点,可以提升报警节点确定的准确性,进而可以使得故障节点的识别判定更加准确。
在其中一个实施例中,根据多个节点之间的调用关系,对各报警节点进行分组,得到报警节点组合,可以包括:根据故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;以任一报警节点为起始报警节点,确定与起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;将关联报警节点作为起始报警节点,并继续确 定与关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;将起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
节点距离是指报警节点与报警节点之间的距离,根据报警节点与报警节点之间非报警节点的数量的不同,节点距离可以不同。
具体地,参考图3,节点A与节点B、节点A与节点E分别存在调用关系,节点B和节点C存在调用关系,节点C和节点D存在调用关系,节点A、节点C以及节点E均为报警节点,节点B和节点D为正常节点。则服务器可以根据则可以根据节点之间的调用关系,确定节点A和节点C之间的节点距离为2(中间隔了非报警节点B),节点A和节点E之间的节点距离为1。
在本实施例中,服务器可以设置节点阈值距离为1,即设定对报警节点进行分组的节点阈值为1。然后,服务器以任一报警节点为起始报警节点,如以节点A为起始报警节点,然后查找节点距离小于或等于1的关联报警节点,即可以找到节点A的关联报警节点为节点E。然后,服务器可以根据起始报警节点以及确定的关联报警节点,得到报警节点组合为节点A和节点E。
在本实施例中,报警节点组合也可以表示为报警节点簇,继续参照前例,报警节点组合节点A和节点E可以表示为报警节点簇为【A,E】。
继续参考图3,当服务器设置节点阈值距离为2时,服务器继续以节点A为起始报警节点,然后查找节点距离小于或等于2的关联报警节点,则服务器可以查找到节点E和节点C均为节点A的关联报警节点。然后服务器可以得到报警节点组合为节点A、节点C和节点E,或者也可以表示为报警节点簇【A,C,E】。
或者,继续参考图3,节点A、节点B、节点C、节点D以及节点E均为报警节点,且服务器设置节点阈值距离为1。服务器继续以节点A为起始报警节点,根据节点阈值距离,可以先确定关联报警节点为节点B和节点E。然后,服务器以节点B为起始报警节点,并确定关联报警节点为节点C。以此类推,服务器可以进一步确定节点C的关联报警节点为节点D,从而得到报警节点组合为节点A、节点B、节点C、节点D和节点E,或者也可以表示为报警节点簇【A,B,C,D,E】。
上述实施例中,根据报警节点之间的节点距离以及节点阈值距离,确定起始报警节点对应的关联报警节点,并生成报警节点组合,从而,可以根据节点距离准确确定关联报警节点,提升关联报警节点确定的准确性。
在其中一个实施例中,参考图4,获取故障系统中各节点的节点数据,可以包括:
步骤S402,获取报警系统采集的故障系统的报警原始数据;
报警原始数据是指从报警系统中直接获取的数据,如报警系统中的报警记录数据等。
步骤S404,对报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,节点报警数据可以包括各节点的报警类型、报警频次以及报警级别中至少一项。
如前文所述,节点报警数据可以包括但不限于报警类型、报警频次以及报警级别等数 据。在本实施例中,服务器在获取到对应的报警原始数据后,可以对报警原始数据进行提取,以从获取的报警原始数据中提取出节点报警数据。
具体地,服务器可以是根据预先设定的提取模板从报警原始数据中提取出节点报警数据。在本实施例中,不同类型的节点,其提取模板可以不同,提取得到的节点报警数据可以不同。
可选地,服务器在从报警原始数据中提取出各节点对应的数据后,还可以对提取的数据进行分析处理,以得到对应各节点的节点报警数据。例如,从报警记录中获取到主机的报警数据后,通过对主机的报警次数进行统计分析,以得到主机的报警频次。
步骤S406,获取各节点的节点基础数据,节点基础数据可以包括节点类型以及节点层级中至少一种。
在本实施例中,服务器可以从故障系统的数据库中直接获取各节点的节点基础数据。
步骤S408,根据各节点报警数据以及节点基础数据生成各节点的节点数据。
在本实施例中,服务器在获取到节点报警数据以及节点基础数据后,可以对同一节点的数据进行组合,以得到对应各节点的节点数据。
上述实施例中,通过从报警原始数据中获取节点报警数据,以及获取节点基础数据,然后生成节点数据,从而使得节点数据包含了节点多方面的特征,使得后续进行报警节点的判定更加准确,进而可以提升故障节点识别判定的准确性。
在其中一个实施例中,根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,图卷积神经网络模型的训练方式可以包括:获取训练样本数据,训练样本数据包括训练图数据以及各节点的节点训练数据;对训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据;将标注后的训练图数据以及训练样本数据输入构建的初始图卷积神经网络模型,通过初始图卷积神经网络模型对训练样本数据进行特征提取,得到特征数据;对特征数据进行回归预测,得到各节点为故障节点和非故障节点的预测结果;基于预测结果以及标注后的训练图数据,确定初始图卷积神经网络模型的损失值,并通过损失值更新初始图卷积神经网络模型的模型参数;对初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
具体地,服务器可以以各不同系统的图数据以及历史节点数据作为训练样本数据。然后服务器根据各节点的节点训练数据,通过标注工具对训练图数据中的每一个节点进行标注,例如,可以通过LabelImg分别标注报警节点、非报警节点、故障节点以及非故障节点等。
进一步,服务器可以对训练图数据进行归一化处理,得到归一化后的训练图数据。
然后服务器将归一化后的训练图数据以及节点训练数据输入构建的初始图卷积神经网络模型,通过初始图卷积神经网络模型进行节点特征的提取,并基于提取的节点特征确 定报警节点以及非报警节点。
在本实施例中,服务器可以对各节点的节点训练数据进行量化处理,如将告警类型、告警级别等进行量化处理,以得到量化处理后的节点训练数据。
进一步,服务器根据确定的报警节点、非报警节点、各节点的节点训练数据以及训练图数据中节点之间的调用关系,进行节点预测,得到各节点为故障节点的预测结果。
在本实施例中,图神经网络模型可以通过计算公式对各节点为故障节点的概率值进行预测,具体计算公式如下:
h=softMax(adj×Re LU(adj×x×ω (1))×ω (2))
h为节点为故障节点的概率值,为n*2的矩阵,n为节点数量;adj为归一化后的训练图数据,可以是一个n*n的邻接矩阵;x为节点训练数据,可以是指量化后的节点训练数据,如n*F的矩阵,F为数据数量,每个节点有F个数据。
进一步,服务器可以根据预测结果以及标注的结果,通过损失函数计算初始图卷积神经网络模型的损失值。例如,通过交叉熵损失函数计算模型的损失值,或者也可以是L1损失函数和/或L2损失函数等,对此不作限制。
然后,服务器可以根据预先设置的学习率以及计算的损失值,对初始图卷积神经网络模型进行迭代处理,并对模型的参数不断更新,以得到训练后的图卷积神经网络模型。
上述实施例中,通过训练后图卷积神经网络模型确定故障系统中的各报警节点,以及得到故障系统中各报警节点为故障节点的初始检测结果,从而,可以提升报警节点以及初始检测结果识别判定的准确性,进而提升故障节点确定的准确性。
在其中一个实施例中,将图数据以及节点数据中至少一个上传至区块链,并存储至区块链的节点中。
区块链是指分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Block chain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。
具体地,区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在本实施例中,服务器可以将图数据以及节点数据中的一个或者多个数据上传并存储于区块链的节点中,以保证数据的私密性和安全性。
上述实施例中,通过将图数据以及节点数据中至少一个上传至区块链并存储于区块链的节点中,从而可以保障存储至区块链节点中数据的私密性,可以提升数据的安全性。
应该理解的是,虽然图2和图4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和图4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必 然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图5所示,提供了一种基于人工智能的故障节点识别装置,包括:图数据获取模块100、节点数据获取模块200、报警节点及初始检测结果生成模块300、报警节点组合确定模块400和故障节点确定模块500,其中:
图数据获取模块100,用于获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系。
节点数据获取模块200,用于获取故障系统中各节点的节点数据。
报警节点及初始检测结果生成模块300,用于根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果。
报警节点组合确定模块400,用于根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合。
故障节点确定模块500,用于根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
在其中一个实施例中,报警节点及初始检测结果生成模块300可以包括:
提取子模块,用于对各节点数据进行特征数据的提取,得到各节点对应的节点特征。
节点标准特征确定子模块,用于确定各节点的节点标准特征,节点标准特征为基于节点在非报警状态下的节点数据提取的特征。
匹配子模块,用于对各节点的节点标准特征以及节点特征进行匹配,得到故障系统中的各报警节点。
在其中一个实施例中,报警节点组合确定模块400可以包括:
节点距离确定子模块,用于根据故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离。
关联故障节点确定子模块,用于以任一报警节点为起始报警节点,确定与起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点。
循环子模块,用于将关联报警节点作为起始报警节点,并继续确定与关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点。
报警节点组合确定子模块,用于将起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
在其中一个实施例中,节点数据获取模块200可以包括:
报警原始数据获取子模块,用于获取报警系统采集的故障系统的报警原始数据。
节点报警数据生成子模块,用于对报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少 一项。
节点基础数据获取子模块,用于获取各节点的节点基础数据,节点基础数据包括节点类型以及节点层级中至少一种。
节点数据生成子模块,用于根据各节点报警数据以及节点基础数据生成各节点的节点数据。
在其中一个实施例中,报警节点及初始检测结果生成模块300根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果可以为通过预先训练的图卷积神经网络模型确定的。
在本实施例中,上述装置还可以包括:模型训练模块,用于训练图卷积神经网络模型。
在本实施例中,模型训练模块可以包括:
训练样本数据获取子模块,用于获取训练样本数据,训练样本数据包括训练图数据以及各节点的节点训练数据。
标注子模块,用于对训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据。
特征提取子模块,用于将标注后的训练图数据以及训练样本数据输入构建的初始图卷积神经网络模型,通过初始图卷积神经网络模型对训练样本数据进行特征提取,得到特征数据。
回归预测子模块,用于对特征数据进行回归预测,得到各节点为故障节点和非故障节点的预测结果。
损失计算子模块,用于基于预测结果以及标注后的训练图数据,确定初始图卷积神经网络模型的损失值,并通过损失值更新初始图卷积神经网络模型的模型参数。
迭代处理子模块,用于对初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
在其中一个实施例中,上述装置还可以包括:
上传存储模块,用于将图数据以及节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
关于基于人工智能的故障节点识别装置的具体限定可以参见上文中对于基于人工智能的故障节点识别方法的限定,在此不再赘述。上述基于人工智能的故障节点识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存 储器包括非易失性或易失性存储介质、内存储器。该非易失性或易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储图数据、节点数据等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于人工智能的故障节点识别方法。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或者多个处理器,该存储器存储有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行以下步骤:获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;获取故障系统中各节点的节点数据;根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
在其中一个实施例中,处理器执行计算机可读指令时实现根据各节点的节点数据确定故障系统中的各报警节点,可以包括:对各节点数据进行特征数据的提取,得到各节点对应的节点特征;确定各节点的节点标准特征,其中,节点标准特征为基于节点在非报警状态下的节点数据提取的特征;及对各节点的节点标准特征以及节点特征进行匹配,得到故障系统中的各报警节点。
在其中一个实施例中,处理器执行计算机可读指令时实现根据多个节点之间的调用关系,对各报警节点进行分组,得到报警节点组合,可以包括:根据故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;以任一报警节点为起始报警节点,确定与起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;将关联报警节点作为起始报警节点,并继续确定与关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;及将起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
在其中一个实施例中,处理器执行计算机可读指令时实现获取故障系统中各节点的节点数据,可以包括:获取报警系统采集的故障系统的报警原始数据;对报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少一项;获取各节点的节点基础数据,节点基础数据包括节点类型以及节点层级中至少一种;及根据各节点报警数据以及节点基础数据生成各节点的节点数据。
在其中一个实施例中,处理器执行计算机可读指令时实现根据各节点的节点数据确定 故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,图卷积神经网络模型的训练方式可以包括:获取训练样本数据,训练样本数据包括训练图数据以及各节点的节点训练数据;对训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据;将标注后的训练图数据以及训练样本数据输入构建的初始图卷积神经网络模型,通过初始图卷积神经网络模型对训练样本数据进行特征提取,得到特征数据;对特征数据进行回归预测,得到各节点为故障节点和非故障节点的预测结果;基于预测结果以及标注后的训练图数据,确定初始图卷积神经网络模型的损失值,并通过损失值更新初始图卷积神经网络模型的模型参数;及对初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
在其中一个实施例中,处理器执行计算机可读指令时还实现如下步骤:将图数据以及节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
一个或多个存储有计算机可读指令的计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:获取故障系统对应的图数据,图数据包括故障系统中多个节点以及多个节点之间的调用关系;获取故障系统中各节点的节点数据;根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果;根据多个节点之间的调用关系,对各报警节点进行分组,得到各报警节点组合;及根据各报警节点组合以及各报警节点的初始检测结果,确定各报警节点组合中的故障节点。
该计算机可读存储介质可以是非易失性,也可以是易失性的。
在其中一个实施例中,计算机可读指令被处理器执行时实现根据各节点的节点数据确定故障系统中的各报警节点,可以包括:对各节点数据进行特征数据的提取,得到各节点对应的节点特征;确定各节点的节点标准特征,其中,节点标准特征为基于节点在非报警状态下的节点数据提取的特征;及对各节点的节点标准特征以及节点特征进行匹配,得到故障系统中的各报警节点。
在其中一个实施例中,计算机可读指令被处理器执行时实现根据多个节点之间的调用关系,对各报警节点进行分组,得到报警节点组合,可以包括:根据故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;以任一报警节点为起始报警节点,确定与起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;将关联报警节点作为起始报警节点,并继续确定与关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;及将起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
在其中一个实施例中,计算机可读指令被处理器执行时实现获取故障系统中各节点的节点数据,可以包括:获取报警系统采集的故障系统的报警原始数据;对报警原始数据进 行节点报警数据的提取,得到各节点的节点报警数据,节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少一项;获取各节点的节点基础数据,节点基础数据包括节点类型以及节点层级中至少一种;及根据各节点报警数据以及节点基础数据生成各节点的节点数据。
在其中一个实施例中,计算机可读指令被处理器执行时实现根据各节点的节点数据确定故障系统中的各报警节点,并根据各报警节点以及多个节点之间的调用关系,得到故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,图卷积神经网络模型的训练方式可以包括:获取训练样本数据,训练样本数据包括训练图数据以及各节点的节点训练数据;对训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据;将标注后的训练图数据以及训练样本数据输入构建的初始图卷积神经网络模型,通过初始图卷积神经网络模型对训练样本数据进行特征提取,得到特征数据;对特征数据进行回归预测,得到各节点为故障节点和非故障节点的预测结果;基于预测结果以及标注后的训练图数据,确定初始图卷积神经网络模型的损失值,并通过损失值更新初始图卷积神经网络模型的模型参数;及对初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
在其中一个实施例中,计算机可读指令被处理器执行时还实现如下步骤:将图数据以及节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范 围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种基于人工智能的故障节点识别方法,包括:
    获取故障系统对应的图数据,所述图数据包括故障系统中多个节点以及多个节点之间的调用关系;
    获取所述故障系统中各节点的节点数据;
    根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果;
    根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到各报警节点组合;及
    根据各所述报警节点组合以及各报警节点的初始检测结果,确定各所述报警节点组合中的故障节点。
  2. 根据权利要求1所述的方法,其中,所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,包括:
    对各所述节点数据进行特征数据的提取,得到各所述节点对应的节点特征;
    确定各所述节点的节点标准特征,其中,所述节点标准特征为基于节点在非报警状态下的节点数据提取的特征;及
    对各所述节点的节点标准特征以及节点特征进行匹配,得到所述故障系统中的各报警节点。
  3. 根据权利要求1所述的方法,其中,所述根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到报警节点组合,包括:
    根据所述故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;
    以任一报警节点为起始报警节点,确定与所述起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;
    将所述关联报警节点作为起始报警节点,并继续确定与所述关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;及
    将所述起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
  4. 根据权利要求1所述的方法,其中,所述获取所述故障系统中各节点的节点数据,包括:
    获取报警系统采集的所述故障系统的报警原始数据;
    对所述报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,所述节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少一项;
    获取各节点的节点基础数据,所述节点基础数据包括节点类型以及节点层级中至少一种;及
    根据各所述节点报警数据以及所述节点基础数据生成各节点的节点数据。
  5. 根据权利要求1所述的方法,其中,所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,所述图卷积神经网络模型的训练方式包括:
    获取训练样本数据,所述训练样本数据包括训练图数据以及各节点的节点训练数据;
    对所述训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据;
    将标注后的所述训练图数据以及所述训练样本数据输入构建的初始图卷积神经网络模型,通过所述初始图卷积神经网络模型对所述训练样本数据进行特征提取,得到特征数据;
    对所述特征数据进行回归预测,得到各所述节点为故障节点和非故障节点的预测结果;
    基于所述预测结果以及所述标注后的训练图数据,确定所述初始图卷积神经网络模型的损失值,并通过所述损失值更新所述初始图卷积神经网络模型的模型参数;及
    对所述初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
  6. 根据权利要求1至5任一项所述的方法,其中,所述方法还包括:
    将所述图数据以及所述节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
  7. 一种基于人工智能的故障节点识别装置,包括:
    图数据获取模块,用于获取故障系统对应的图数据,所述图数据包括故障系统中多个节点以及多个节点之间的调用关系;
    节点数据获取模块,用于获取所述故障系统中各节点的节点数据;
    报警节点及初始检测结果生成模块,用于根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果;
    报警节点组合确定模块,用于根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到各报警节点组合;及
    故障节点确定模块,用于根据各所述报警节点组合以及各报警节点的初始检测结果,确定各所述报警节点组合中的故障节点。
  8. 根据权利要求7所述的装置,其中,所述报警节点及初始检测结果生成模块,包括:
    提取子模块,用于对各所述节点数据进行特征数据的提取,得到各所述节点对应的节点特征;
    节点标准特征确定子模块,用于确定各所述节点的节点标准特征,其中,所述节点标 准特征为基于节点在非报警状态下的节点数据提取的特征;及
    匹配子模块,用于对各所述节点的节点标准特征以及节点特征进行匹配,得到所述故障系统中的各报警节点。
  9. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器存储有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取故障系统对应的图数据,所述图数据包括故障系统中多个节点以及多个节点之间的调用关系;
    获取所述故障系统中各节点的节点数据;
    根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果;
    根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到各报警节点组合;及
    根据各所述报警节点组合以及各报警节点的初始检测结果,确定各所述报警节点组合中的故障节点。
  10. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,包括:
    对各所述节点数据进行特征数据的提取,得到各所述节点对应的节点特征;
    确定各所述节点的节点标准特征,其中,所述节点标准特征为基于节点在非报警状态下的节点数据提取的特征;及
    对各所述节点的节点标准特征以及节点特征进行匹配,得到所述故障系统中的各报警节点。
  11. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到报警节点组合,包括:
    根据所述故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;
    以任一报警节点为起始报警节点,确定与所述起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;
    将所述关联报警节点作为起始报警节点,并继续确定与所述关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;及
    将所述起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
  12. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述获取所述故障系统中各节点的节点数据,包括:
    获取报警系统采集的所述故障系统的报警原始数据;
    对所述报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,所述节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少一项;
    获取各节点的节点基础数据,所述节点基础数据包括节点类型以及节点层级中至少一种;及
    根据各所述节点报警数据以及所述节点基础数据生成各节点的节点数据。
  13. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时所实现的所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,所述图卷积神经网络模型的训练方式包括:
    获取训练样本数据,所述训练样本数据包括训练图数据以及各节点的节点训练数据;
    对所述训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、故障节点以及非故障节点的训练图数据;
    将标注后的所述训练图数据以及所述训练样本数据输入构建的初始图卷积神经网络模型,通过所述初始图卷积神经网络模型对所述训练样本数据进行特征提取,得到特征数据;
    对所述特征数据进行回归预测,得到各所述节点为故障节点和非故障节点的预测结果;
    基于所述预测结果以及所述标注后的训练图数据,确定所述初始图卷积神经网络模型的损失值,并通过所述损失值更新所述初始图卷积神经网络模型的模型参数;及
    对所述初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
  14. 根据权利要求9至13任一项所述的计算机设备,所述处理器执行所述计算机可读指令时还实现以下步骤:
    将所述图数据以及所述节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
  15. 一个或多个存储有计算机可读指令的计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤;
    获取故障系统对应的图数据,所述图数据包括故障系统中多个节点以及多个节点之间的调用关系;
    获取所述故障系统中各节点的节点数据;
    根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果;
    根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到各报警节点组 合;及
    根据各所述报警节点组合以及各报警节点的初始检测结果,确定各所述报警节点组合中的故障节点。
  16. 根据权利要求15所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,包括:
    对各所述节点数据进行特征数据的提取,得到各所述节点对应的节点特征;
    确定各所述节点的节点标准特征,其中,所述节点标准特征为基于节点在非报警状态下的节点数据提取的特征;及
    对各所述节点的节点标准特征以及节点特征进行匹配,得到所述故障系统中的各报警节点。
  17. 根据权利要求15所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述根据所述多个节点之间的调用关系,对各所述报警节点进行分组,得到报警节点组合,包括:
    根据所述故障系统中多个节点之间的调用关系,确定任意两个报警节点之间的节点距离;
    以任一报警节点为起始报警节点,确定与所述起始报警节点的节点距离小于或等于节点阈值距离的关联报警节点;
    将所述关联报警节点作为起始报警节点,并继续确定与所述关联报警节点的节点距离小于或等于节点阈值距离的关联报警节点;及
    将所述起始报警节点以及对应的关联报警节点划分为同一报警节点组合。
  18. 根据权利要求15所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述获取所述故障系统中各节点的节点数据,包括:
    获取报警系统采集的所述故障系统的报警原始数据;
    对所述报警原始数据进行节点报警数据的提取,得到各节点的节点报警数据,所述节点报警数据包括各节点的报警类型、报警频次以及报警级别中至少一项;
    获取各节点的节点基础数据,所述节点基础数据包括节点类型以及节点层级中至少一种;及
    根据各所述节点报警数据以及所述节点基础数据生成各节点的节点数据。
  19. 根据权利要求15所述的存储介质,其中,所述计算机可读指令被所述处理器执行时所实现的所述根据各所述节点的节点数据确定所述故障系统中的各报警节点,并根据各所述报警节点以及所述多个节点之间的调用关系,得到所述故障系统中各报警节点为故障节点的初始检测结果为通过预先训练的图卷积神经网络模型确定的,所述图卷积神经网络模型的训练方式包括:
    获取训练样本数据,所述训练样本数据包括训练图数据以及各节点的节点训练数据;
    对所述训练图数据中各节点进行标注,得到标注有各节点为报警节点、非报警节点、 故障节点以及非故障节点的训练图数据;
    将标注后的所述训练图数据以及所述训练样本数据输入构建的初始图卷积神经网络模型,通过所述初始图卷积神经网络模型对所述训练样本数据进行特征提取,得到特征数据;
    对所述特征数据进行回归预测,得到各所述节点为故障节点和非故障节点的预测结果;
    基于所述预测结果以及所述标注后的训练图数据,确定所述初始图卷积神经网络模型的损失值,并通过所述损失值更新所述初始图卷积神经网络模型的模型参数;及
    对所述初始图卷积神经网络模型进行迭代处理,得到训练后的图卷积神经网络模型。
  20. 根据权利要求15至19任一项所述的存储介质,其中,所述计算机可读指令被所述处理器执行时还可以实现如下步骤:
    将所述图数据以及所述节点数据中的至少一个上传至区块链,并存储至区块链的节点中。
PCT/CN2020/098772 2020-06-09 2020-06-29 基于人工智能的故障节点识别方法、装置、设备和介质 WO2021114613A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010517479.1 2020-06-09
CN202010517479.1A CN111679953B (zh) 2020-06-09 2020-06-09 基于人工智能的故障节点识别方法、装置、设备和介质

Publications (1)

Publication Number Publication Date
WO2021114613A1 true WO2021114613A1 (zh) 2021-06-17

Family

ID=72454134

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098772 WO2021114613A1 (zh) 2020-06-09 2020-06-29 基于人工智能的故障节点识别方法、装置、设备和介质

Country Status (2)

Country Link
CN (1) CN111679953B (zh)
WO (1) WO2021114613A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968947A (zh) * 2022-03-01 2022-08-30 华为技术有限公司 一种故障文件保存方法及相关装置
US20230229781A1 (en) * 2022-01-14 2023-07-20 Lenovo (Singapore) Pte. Ltd. Predicting system misconfigurations using machine learning

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434193B (zh) * 2020-10-27 2023-09-29 北京空间飞行器总体设计部 一种引导式系统故障快速排查方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107222339A (zh) * 2017-05-27 2017-09-29 全球能源互联网研究院 基于图数据库的电力信息通信系统的故障分析方法及装置
CN109740025A (zh) * 2019-01-11 2019-05-10 中电福富信息科技有限公司 基于图数据模型的故障影响分析方法
CN109756376A (zh) * 2019-01-11 2019-05-14 中电福富信息科技有限公司 基于图数据模型的告警关联分析方法
CN110134539A (zh) * 2019-05-14 2019-08-16 极智(上海)企业管理咨询有限公司 一种分布式系统故障根源的诊断方法
US20200021607A1 (en) * 2015-08-31 2020-01-16 Splunk Inc. Detecting Anomalies in a Computer Network Based on Usage Similarity Scores

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR464601A0 (en) * 2001-04-30 2001-05-24 Commonwealth Of Australia, The Shapes vector
CN104579717B (zh) * 2013-10-09 2018-02-23 中国移动通信集团江苏有限公司 Dcn的故障定位方法及装置
CN108322351B (zh) * 2018-03-05 2021-09-10 北京奇艺世纪科技有限公司 生成拓扑图的方法和装置、故障确定方法和装置
CN108845912B (zh) * 2018-06-11 2019-08-06 掌阅科技股份有限公司 服务接口调用故障的报警方法及计算设备
CN111193605B (zh) * 2019-08-28 2022-02-01 腾讯科技(深圳)有限公司 一种故障定位方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200021607A1 (en) * 2015-08-31 2020-01-16 Splunk Inc. Detecting Anomalies in a Computer Network Based on Usage Similarity Scores
CN107222339A (zh) * 2017-05-27 2017-09-29 全球能源互联网研究院 基于图数据库的电力信息通信系统的故障分析方法及装置
CN109740025A (zh) * 2019-01-11 2019-05-10 中电福富信息科技有限公司 基于图数据模型的故障影响分析方法
CN109756376A (zh) * 2019-01-11 2019-05-14 中电福富信息科技有限公司 基于图数据模型的告警关联分析方法
CN110134539A (zh) * 2019-05-14 2019-08-16 极智(上海)企业管理咨询有限公司 一种分布式系统故障根源的诊断方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230229781A1 (en) * 2022-01-14 2023-07-20 Lenovo (Singapore) Pte. Ltd. Predicting system misconfigurations using machine learning
CN114968947A (zh) * 2022-03-01 2022-08-30 华为技术有限公司 一种故障文件保存方法及相关装置

Also Published As

Publication number Publication date
CN111679953A (zh) 2020-09-18
CN111679953B (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
CN109032829B (zh) 数据异常检测方法、装置、计算机设备及存储介质
WO2021114613A1 (zh) 基于人工智能的故障节点识别方法、装置、设备和介质
WO2021042843A1 (zh) 告警信息的决策方法、装置、计算机设备及存储介质
CN111475804B (zh) 一种告警预测方法及系统
WO2021139252A1 (zh) 运维故障根因识别方法、装置、计算机设备和存储介质
WO2021135499A1 (zh) 损伤检测模型训练、车损检测方法、装置、设备及介质
CN111767707B (zh) 雷同病例检测方法、装置、设备及存储介质
CN110675959B (zh) 数据智能分析方法、装置、计算机设备及存储介质
WO2019232964A1 (zh) 风险管理数据处理方法、装置、计算机设备和存储介质
CN111475370A (zh) 基于数据中心的运维监控方法、装置、设备及存储介质
WO2022095434A1 (zh) 基于自编码器的数据异常识别方法、装置和计算机设备
WO2022252454A1 (zh) 异常数据检测方法、装置、计算机设备和可读存储介质
WO2020073727A1 (zh) 风险预测方法、装置、计算机设备和存储介质
CN111400126B (zh) 网络服务异常数据检测方法、装置、设备和介质
CN113190426B (zh) 一种大数据评分系统稳定性监控方法
CN113705685A (zh) 疾病特征识别模型训练、疾病特征识别方法、装置及设备
WO2021103401A1 (zh) 数据对象分类方法、装置、计算机设备和存储介质
CN113707296B (zh) 医疗方案数据处理方法、装置、设备及存储介质
CN113110961B (zh) 设备异常检测方法、装置、计算机设备及可读存储介质
CN110166422A (zh) 域名行为识别方法、装置、可读存储介质和计算机设备
CN111178407B (zh) 路况数据筛选方法、装置、计算机设备及存储介质
CN117118693A (zh) 异常流量的检测方法、装置、计算机设备和存储介质
CN116776150A (zh) 接口异常访问识别方法、装置、计算机设备及存储介质
CN110874612B (zh) 时段预测方法、装置、计算机设备和存储介质
CN113627514A (zh) 知识图谱的数据处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899700

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899700

Country of ref document: EP

Kind code of ref document: A1