CN116980279A - Fault diagnosis system and fault diagnosis method for programmable network element equipment - Google Patents

Fault diagnosis system and fault diagnosis method for programmable network element equipment Download PDF

Info

Publication number
CN116980279A
CN116980279A CN202311238378.0A CN202311238378A CN116980279A CN 116980279 A CN116980279 A CN 116980279A CN 202311238378 A CN202311238378 A CN 202311238378A CN 116980279 A CN116980279 A CN 116980279A
Authority
CN
China
Prior art keywords
network element
fault
programmable network
element equipment
diagnosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311238378.0A
Other languages
Chinese (zh)
Other versions
CN116980279B (en
Inventor
薛镭
高万鑫
肖戈扬
朱俊
邹涛
张汝云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311238378.0A priority Critical patent/CN116980279B/en
Publication of CN116980279A publication Critical patent/CN116980279A/en
Application granted granted Critical
Publication of CN116980279B publication Critical patent/CN116980279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application discloses a fault diagnosis system and a fault diagnosis method of programmable network element equipment, wherein the system comprises: the data acquisition device is used for storing the globally consistent flow table on the programmable network element equipment in a file form as a backup, and acquiring the dependency relationship and the cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and the cooperative operation logic to the diagnosis decision device; the network measurer is used for measuring the running state of the programmable network element equipment at fixed time and reporting the running state to the diagnosis decision-making device; the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state data uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor; and the fault processor is used for remotely controlling the programmable network element equipment which is diagnosed to be in a fault state. The application can realize on-line fault diagnosis.

Description

Fault diagnosis system and fault diagnosis method for programmable network element equipment
Technical Field
The present application relates to the field of computer networks, and in particular, to a fault diagnosis system and a fault diagnosis method for a programmable network element device.
Background
In a programmable network, programmable network element devices play the role of a data plane and are used to configure network programs delivered by users. These devices typically have a high degree of flexibility and programmability to accommodate different network requirements. At present, the inspection of the programmable network element equipment of the whole network is carried out in a manual semi-automatic mode, which comprises the steps of analyzing the equipment log and the state information of the programmable network element by using an online log tool and a fault analysis tool, carrying out the physical inspection of a manual site and the like. However, current programmable network element devices require manual diagnosis and maintenance when they fail, which can lead to the following problems:
(1) The maintenance cost is higher: the programmable network element device in the fault state requires human intervention for diagnosis and repair. This involves the involvement of a skilled person and time costs, resulting in high maintenance costs.
(2) The maintenance period is longer: the underlying resources do not support temporarily non-triggered packet processing logic in the network program, which may result in packets that cannot be properly processed or discarded in certain specific situations. To solve this problem, a detailed fault diagnosis needs to be performed manually to determine the specific cause of the equipment fault, resulting in a longer maintenance period.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a fault diagnosis system and a fault diagnosis method of programmable network element equipment, so as to achieve the purposes of improving the operation and maintenance efficiency and reliability of the programmable network element equipment and reducing the operation and maintenance cost.
The aim of the application is achieved by the following technical scheme:
a fault diagnosis system of programmable network element equipment comprises a data acquisition unit, a network measurer, a diagnosis decision-making unit and a fault processor;
the data collector is used for storing a globally consistent flow table on the programmable network element equipment in a file form as a backup, collecting the dependency relationship and the cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and the cooperative operation logic to the diagnosis decision-making device;
the network measurer is used for measuring the running state of the programmable network element equipment at fixed time and reporting the running state to the diagnosis decision-making device;
the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state data uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor;
and the fault processor remotely controls the programmable network element equipment diagnosed to be in a fault state, thermally restarting the system, initializing the programmable network element equipment by using the flow table backed up in the data acquisition unit, and re-accessing the network.
Further, the data collector makes and marks a data set aiming at the related working state data of the programmable network element equipment in a conventional class use scene, and clearly marks task information and marks diagnostic elements in the conventional class scene.
Further, the conventional scenes comprise normal operation scenes and built fault scenes, and the diagnostic element labeling modes comprise semantic recognition, entity recognition and data cleaning.
Further, the diagnostic decision maker comprises a communication module, a diagnostic algorithm module and a strategy generation module;
the communication module is communicated with the distributed data acquisition device on one hand, acquires the dependency relationship, the cooperative operation logic and the identification of the communication stage among the working tasks of the programmable network element equipment, is communicated with the fault processor on the other hand, issues the fault processing rule generated by the strategy generation module, and acquires the measurement data of the programmable network element equipment after the data is calculated by the fault processor;
the diagnosis algorithm module reads the dependency relationship and the cooperative task identification between the cooperative tasks of the programmable network element equipment, calculates the starting priority of the cooperative tasks by using a topology sequencing algorithm and outputs the starting priority to the strategy generation module; when topological ordering is carried out on the dependency relationship of the collaborative tasks, the collaborative tasks started at the same time are marked as the same priority;
the strategy generation module is used for formulating specific processing rules according to the function limit of the work task and the work task priority with the dependency relationship generated by the diagnosis algorithm module, wherein the function limit comprises the number limit of the work task priority channels supported by the programmable network element equipment.
Further, the network measurer provides real-time equipment state information feedback to the diagnostic decision maker by monitoring key indexes and performance parameters.
Furthermore, the diagnosis algorithm module takes data acquired by a mass data acquisition unit and feedback data of the running condition of programmable network element equipment as input, completes the dependency relationship of a work task and collaborative operation logic through deep neural network training and reasoning to construct an AI model, iterates self model parameters through continuous interaction with real network environment information under a deep reinforcement learning framework, and modifies the self parameters through output excitation, thereby improving the accuracy and effect of the model.
Further, the strategy generation module generates processing rule information according to the function limit of the work task and the cooperative task priority with the dependency relationship output by the diagnosis algorithm module, and sends the processing rule information to the fault processor through the communication module.
Further, after receiving the equipment state information sent by the network measurer, the communication module is added into a message receiving queue of the diagnostic algorithm module, and adds the processing rule information sent by the policy generation module into a message sending queue.
Further, the fault processor analyzes after receiving the policy message sent by the communication module, extracts the position information of the programmable network element, and performs remote control.
The fault diagnosis method of the programmable network element equipment is realized based on a fault diagnosis system of the programmable network element equipment, and specifically comprises the following steps:
the data collector is used for storing a globally consistent flow table on the programmable network element equipment in a file form as a backup, collecting the dependency relationship and cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and cooperative operation logic to the diagnosis decision-making device;
the running state of the programmable network element equipment is measured at fixed time through the network measurer and reported to the diagnosis decision-making device;
the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor;
and remotely controlling the programmable network element equipment diagnosed to be in a fault state through the fault processor, thermally restarting the system, initializing the programmable network element equipment by using the backup flow table in the data acquisition unit, and re-accessing the network.
The beneficial effects of the application are as follows:
the fault diagnosis system of the programmable network element equipment realizes real-time monitoring of the working state of the programmable network element equipment during operation through the data acquisition device and the network measurer, and realizes on-line diagnosis and fault processing based on artificial intelligent programs during the fault of the programmable network element equipment through the diagnosis decision device and the fault processor. By adopting the fault diagnosis system and the fault diagnosis method, the on-line fault diagnosis of the programmable network element equipment can be realized based on the artificial intelligence technology, and the reliability and the stability of the whole network are improved.
Drawings
When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Fig. 1 is a schematic diagram of a fault diagnosis system according to an embodiment of the application.
Fig. 2 is a schematic diagram of a data collector according to an embodiment of the application.
Fig. 3 is a schematic diagram of a network measurer according to an embodiment of the application.
Fig. 4 is a schematic diagram of a diagnostic flow of a diagnostic decision maker according to an embodiment of the present application.
Fig. 5 is a schematic diagram of a fault handler according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
Next, a fault diagnosis system and a fault diagnosis method for the programmable network element device disclosed in the embodiment of the present application are described.
Fig. 1 is a schematic diagram of a fault diagnosis system of one of the programmable network element devices according to the present application. The system comprises: the system comprises a data acquisition unit, a network measurer, a diagnosis decision making unit and a fault processor.
1. Data acquisition device
The data collector is used for storing the globally consistent flow table on the programmable network element equipment in the form of a file as a backup, collecting the dependency relationship and the cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and the cooperative operation logic to the diagnosis decision-making device.
In this embodiment, fig. 2 is a schematic diagram of a data collector. The data collector periodically acquires the globally consistent flow table data on the programmable network element equipment, and backs up and stores the flow table data in a file form, so that local storage or cloud storage can be selected as a storage medium. The data set is produced and marked according to the following steps:
1. and determining a conventional class scene comprising a normal operation scene and a built fault scene. And carrying out data set making and labeling on the conventional class scene, wherein the data set making and labeling comprises the step of collecting relevant work task information in the conventional class scene.
2. Labeling diagnosis elements in a conventional class scene, wherein the labeling comprises semantic recognition, entity recognition, data cleaning and the like; the data set is marked in an automatic or manual mode, so that the accuracy and consistency of the data set are ensured.
3. The data set is randomly segmented into a training set and a testing set, the training set is used for training the deep neural network model, the testing set is used for evaluating the performance of the model, and the robustness and generalization capability of the model can be further verified by adopting cross verification and other technologies.
And a communication module is established between the data acquisition device and the diagnosis decision device through TCP connection, so that data transmission is realized. Once the connection is established, the data collector starts a process to collect the work task dependency and the collaborative operation logic relationship data between the programmable network element devices. These data are then uploaded to a diagnostic decision maker through an API provided by the communication module for subsequent analysis and decision making.
The data collector needs to perform comprehensive data collection on the configuration and the running state of the programmable network element equipment so as to obtain the dependence and the cooperative logic relation of various work tasks. Collected data includes, but is not limited to, flow table rules, forwarding paths, service chain topology, network slice information, and the like. In addition, information such as software and hardware versions, resource use states and the like of the network elements are required to be collected so as to support the diagnosis decision-making device to carry out comprehensive fault diagnosis and decision optimization. In order to ensure timeliness and accuracy of data acquisition, mechanisms such as incremental acquisition, heartbeat detection and the like are required to be adopted for data synchronization, and fault recovery and retransmission mechanisms are set to improve stability.
2. Network measurer
The network measurer is used for measuring the running state of the programmable network element equipment at fixed time and reporting the running state to the diagnosis decision-making device.
In this embodiment, fig. 3 is a schematic diagram of a network measurer. The functions of the network measurer include:
the network measurer establishes connection with the programmable network element device through physical connection or network connection, and establishes TCP connection with a communication module in the diagnostic decision-making device, and is used for feeding back measured device state information to the diagnostic decision-making device.
After the connection is established, the key indexes and performance parameters of the measurement, such as network delay, bandwidth utilization, packet loss rate and the like, are properly selected according to the characteristics and requirements of the programmable network equipment.
The network measurer measures key indexes and performance parameters of the programmable network element equipment at fixed time, and calls a data transmission API provided by the programmable network element equipment to acquire running state data of the equipment.
And the network measurer calls the API call and other modes provided by the communication module, and feeds back the measured running state data of the programmable network element equipment to the diagnosis decision-making device for further analysis and processing.
Network measurer needs to adopt a measuring mode of combining active and passive. The active measurement means that the network measurer actively transmits a detection data packet to the programmable network element device, and analyzes a return result to obtain performance parameters. Passive measurement refers to analyzing the state through existing statistics and log information of the programmable network element device. In addition, the characteristics of different network links and devices also need to be considered, and a targeted measurement scheme is designed.
In order to provide comprehensive and accurate measurement data, the network measurer needs to be deployed on a key node of the network, and meanwhile needs to cooperate with a measurement module in the programmable network element equipment to acquire more internal state information of the first hand. In addition, mechanisms such as time synchronization, measurement configuration coordination and the like are needed to ensure that measurement results of different nodes can be aligned and associated.
The acquired measurement data is fed back to the diagnostic decision maker, and the network measurer needs to perform necessary online analysis and processing to extract effective state characteristic information. There is also a need to provide a mechanism for data buffering and retransmission to improve the reliability of the measurement data.
3. Diagnostic decision maker
The diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state data uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor.
In this embodiment, fig. 4 is a schematic diagram of a diagnostic decision maker. The diagnostic decision maker comprises a communication module, a diagnostic algorithm module and a strategy generation module.
1. Communication module
On one hand, after the communication module establishes TCP connection with the distributed data collector, the communication module receives the dependency relationship of the working task of the programmable network element equipment, the cooperative operation logic and the identification of the communication stage; on the other hand, the communication module establishes TCP connection with the fault processor, issues the fault processing rule generated by the strategy generating module, and acquires the measurement data of the programmable network element equipment after the data is calculated by the fault processor.
After receiving the equipment state information sent by the network measurer, the communication module adds the equipment state information into a message receiving queue of the diagnosis algorithm module, and adds the processing rule information sent by the strategy generating module into a message sending queue.
The communication module is responsible for providing a reliable and safe communication mechanism and ensuring the data interaction between the inside and outside of the diagnosis system. The communication module needs to handle the isomerism among different components and provide a unified communication interface and a data format conversion function. Meanwhile, the communication module also needs to realize mechanisms such as connection management, flow control, buffering, retransmission and the like, so as to ensure the communication quality.
In order to support large-scale network equipment connection, the communication module can adopt a distributed design, deploy multi-instance service and realize load balancing. And meanwhile, the communication efficiency is improved by using the technologies of caching, multicasting, filtering and the like.
Considering network security factors, the communication module needs to provide a perfect security protection mechanism, including identity authentication, data encryption, access control, and the like. The transmission of important sensitive information requires a cryptographic signature to prevent tampering.
The communication module also needs to provide customizable communication protocol plug-in functionality to adapt to heterogeneous networks and devices. And the functions of monitoring, statistics, log and the like are realized, so that network communication faults can be conveniently checked.
For key control instruction issuing and other operations, the communication module needs to ensure the atomicity, consistency and reliability of transmission. And meanwhile, a timeout retransmission and idempotent verification mechanism is required to be set, so that fault conditions such as network jitter and packet loss are processed, and reliable execution of instructions is ensured.
Clear interface convention is also required between the communication module and other components to perform good isolation. And providing a simulation test interface for simulating various communication scenes and faults during system integration test.
2. Diagnostic algorithm module
The diagnosis algorithm module reads the dependency relationship and the cooperative task identification between the working tasks of the programmable network element equipment, calculates the starting priority of the working tasks by using the topology sequencing algorithm, and outputs the starting priority to the strategy generation module. When topological ordering is carried out on the dependency relationship of the collaborative tasks, the collaborative tasks started at the same time are marked as the same priority.
The diagnosis algorithm module takes data collected by a massive data collector and feedback data of the running condition of programmable network element equipment as input, completes the work task dependency relationship and collaborative operation logic construction AI model through deep neural network training and reasoning, iterates own model parameters through continuous interaction with real network environment information under a deep reinforcement learning framework, and modifies own parameters through output excitation, thereby improving the accuracy and effect of the model.
The processed input data is the dependency and co-operating logic of the work task, whereas for a sequence the attention mechanism is essentially to find the interrelationship between different token in the input, from the word to word by weight matrix. Thus, the deep-learning neural network architecture is preferably a transducer model based entirely on self-attention mechanisms.
The machine learned attention formula is as follows:
the workflow of the attention mechanism is:
1) Respectively converting the input sequence X into a Query matrix, a Key matrix and a Value matrix to obtain Q, K, V;
2) Calculating a correlation score by using the dot product of Q and K to obtain an attention matrix;
3) Performing softmax on the attention matrix to obtain attention weight alpha;
4) And carrying out weighted summation on alpha and V to obtain output.
Thus, through the calculation of Q, K, V, the association modeling among the features of different positions in the input is realized. Q reflects the content to be matched, K reflects the matching object, and V reflects the value of the matching object. The attention weight α merges the degree of association between the two. Attention is obtained by calculating the similarity of Q, K and multiplying V to obtain an attention value; the self-attention is that each Q and each K calculate attention coefficients in turn; the self-attention is preferably dot product attention.
The transducer model consists of two parts, an Encoder set containing 6 encodings and a Decoder set containing 6 decoders, as follows:
Transformer = 6 * Encoder + 6 * Decoder
wherein, the Encoder is used for extracting the characteristics, and the Decode is used for generating tasks;
Encoder = Embedding + Positional Embedding +N * EncoderBlock
because a computer cannot directly process a word or a Chinese character, a token needs to be converted into a vector which can be identified by the computer, namely an Embedding process; the word operation preferably uses WordEmbedding in consideration of the relation between the front and rear of the words; while Positional Embedding adds the front and rear position information to the input embedded; .
Preferably, the input representation abstracts each work task into a vector, including task ID, type, priority, resource requirements, etc.; the tasks are arranged according to time sequence and are integrated with sequence information; inputting a vector representation of a task for each time step; the data for each location in the input sequence may be focused on information for other locations, thereby extracting features by attention or capturing relationships between each token of the input sequence.
The input sequence X is denoted as (X 1 , x 2 , ..., x n ) Wherein x is i Is a d-dimensional vector, representing a word vector with a sequence number i,
X= Embedding + Positional Embedding
from the input-output angle, the input of the first encoderBlock in the N encoderBlock is a group of vectors X= (encoding+ Positional Embedding), the vector dimension is usually 512X 512, the input of the other N-1 encoderBlock is the output of the last encoderBlock, and the dimension of the output vector is also 512X 512; the former 512 is to cover different sequence lengths, padding when insufficient; the latter 512 refers to the vector dimension generated by each token, i.e., each token uses a vector representation of sequence length 512; after the N Transformer Encoder operations are completed, their outputs are formally input into Transformer Decoder for use as K and V in QKV for Transformer Decoder.
EncoderBlock = SubEncoderBlock1 + SubEncoderBlock2
SubEncoderBlock1 = Multi head attention + Add+ Norm
SubEncoderBlock2 = Feed Forward + Add+ Norm
Multi head attention convolving the input features into groups of channels separately; add is a residual connection technology, which can prevent network degradation and is commonly used for solving the problem that multilayer networks are difficult to train; norm is the mean value of the input samples is zero and the variance is 1; the Feed Forward projects the information into a specific space and then carries out nonlinear mapping;
preferably, the Decoder outputs an optimal task arrangement sequence based on the Encoder representation, and learns dependency and constraint relationships between tasks by using Encoder-Decoder attention;
Decoder = Embedding + Positional Embedding +N * DecoderBlock + Linear + Softmax
DecoderBlock = SubDecoderBlock1 + SubDecoderBlock2 + SubDecoderBlock3
SubDecoderBlock1 = Mask Multi head attention + ADD + Norm
SubDecoderBlock2 = Multi head attention + ADD + Norm
SubDecoderBlock3 = Feed Forward + ADD + Norm
mask Multi head attention can prevent the ith token from knowing information after i+1 tokens
Preferably, the training model builds a plurality of examples of different task combinations, provides an optimal collaborative order as a supervisory signal, designs a suitable loss function, and outputs a task arrangement in the optimal order.
In hardware, the preferred base model is that 8P 100 GPUs are trained for 12 hours, and the large model is trained for 3.5 days. Model parameters and parameter tuning levels:
trainable parameters include WQ, WK, WV, WO, feed Forward middle layer parameters;
the adjustable parameters include: the dimension represented by each token vector, the head number of the head, the number of block repetitions in the Encoder and Decoder N, FFN the dimension of the middle layer vector, label smoothing (confidence 0.1) and dropout (0.1).
Preferably, the inference model combines the newly input work tasks, and the inference model can rapidly predict the optimal collaborative sequence. Preferably, reinforcement learning may construct a simulated environment that evaluates performance of different collaborative sequences as reward signals, guiding model continuous learning.
3. Policy generation module
And formulating processing rule information according to the function limit of the work task and the cooperative task priority with the dependency relationship generated by the diagnostic algorithm module, wherein the function limit comprises the number limit of the work task priority channels supported by the programmable network element equipment.
The strategy generation module is responsible for generating a final fault processing strategy and rule according to the analysis result of the diagnosis algorithm and combining the function limitation of the network equipment. The module needs to realize the following functions:
(1) And receiving and analyzing the fault cause and influence analysis result output by the diagnosis algorithm, and extracting information such as key factors and influenced task chains causing faults.
(2) Planning a fault processing flow according to the functions supported by the network equipment and the constraint conditions such as the number of priority channels and the like, and determining processing strategies such as task priority, re-ordering sequence and the like.
(3) The processing policy is converted into specific rules and instructions, such as rules operable for flow table modification, task redeployment, etc. The rules need to include actions performed, objects, and related parameters.
(4) And sending the generated rule to a communication module, and sending the generated rule to network equipment or a management system, wherein the generated rule is executed by an administrator. While supporting the generation of batch rules.
(5) After the processing rule is issued, a feedback state needs to be acquired, and whether the rule is loaded and validated correctly is confirmed. If there is an error, the rules need to be re-issued or revised.
(6) And verifying the rule execution effect according to the network state after the new rule is operated, and performing iterative optimization if necessary.
(7) And (3) storing a log generated by the strategy rules to record information such as processing basis, parameter setting and the like, and facilitating subsequent analysis or audit.
(8) A custom extension interface for policy generation is provided to support customized policy generation for different types of devices.
The processing scheme output by the strategy generation module needs to be comprehensively verified, so that the correctness, the effectiveness and the safety of the processing scheme are ensured. The policy generation module, the diagnostic algorithm and the network device need to clearly define interfaces, and loose coupling of components is guaranteed.
4. Fault processor
The fault processor carries out remote control on the programmable network element equipment diagnosed to be in a fault state, the system is restarted by heat, the programmable network element equipment is initialized by using the flow table backed up in the data acquisition device, and the network is accessed again. Specifically, the fault processor analyzes the policy message sent by the communication module after receiving the policy message, extracts the position information of the programmable network element, and performs remote control.
A fault handler is a component in a diagnostic system that performs specific fault handling actions. It needs to realize the following key functions:
(1) And receiving and analyzing the fault processing strategy sent by the communication module, and extracting information such as target network equipment, processing rules and the like.
(2) And sending a control instruction to the appointed network equipment according to the strategy instruction, and performing operations such as flow table updating, system restarting and the like. While processing acknowledgement feedback for command execution.
(3) And acquiring the backup flow table and configuration information from the data acquisition device for reconstructing the expected state of the network equipment.
(4) And connecting with a network management platform, and calling related interfaces to execute fault processing flows, such as equipment isolation, traffic migration and the like.
(5) And interacting with a control layer of the network equipment, sending a low-level instruction, and realizing the bottom-level control such as system restarting, component resetting and the like.
(6) After the fault handling is performed, the device state needs to be verified to confirm whether the fault is corrected. If not completely cleared, a new round of diagnostics needs to be triggered.
(7) And recording the whole process log of fault processing, including information such as control instructions, parameter setting, feedback results and the like, and using the whole process log for tracking and auditing.
(8) In complex situations, it is desirable to support rollback of partial fault handling, avoiding operations that cause larger faults.
(9) To reduce the scope of fault impact, coordination among the device groups is required to ensure that fault handling is performed at the proper timing and sequence.
(10) And monitoring the network running state after the equipment fault processing by using a network measurement and analysis component, and providing processing effect feedback.
The fault processor is very critical as an execution layer, and is stable and reliable. It is desirable to ensure that the processor is sufficiently redundant and that a failover mechanism is employed to avoid a single point of failure.
In this embodiment, fig. 5 is a schematic diagram of a fault handler.
The fault processor receives the fault information transmitted by the diagnosis decision device and confirms the fault state of the programmable network element equipment. And remotely controlling the programmable network element equipment diagnosed as the fault, executing the operation of the hot restarting system, restarting the programmable network element equipment, initializing the flow table information into the programmable network element equipment by using the flow table backed up in the data acquisition device, and connecting the programmable network element equipment to the network again to restore the normal running state.
Another embodiment of the present application provides a fault diagnosis method for a programmable network element device, where the method is implemented based on the fault diagnosis system of the programmable network element device, and the method specifically includes:
the data collector is used for storing a globally consistent flow table on the programmable network element equipment in a file form as a backup, collecting the dependency relationship and cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and cooperative operation logic to the diagnosis decision-making device;
the running state of the programmable network element equipment is measured at fixed time through the network measurer and reported to the diagnosis decision-making device;
the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor;
and remotely controlling the programmable network element equipment diagnosed to be in a fault state through the fault processor, thermally restarting the system, initializing the programmable network element equipment by using the backup flow table in the data acquisition unit, and re-accessing the network.
It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the application, and is not intended to limit the application, but rather to limit the application to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. Modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (10)

1. The fault diagnosis system of the programmable network element equipment is characterized by comprising a data acquisition unit, a network measurer, a diagnosis decision device and a fault processor;
the data collector is used for storing a globally consistent flow table on the programmable network element equipment in a file form as a backup, collecting the dependency relationship and the cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and the cooperative operation logic to the diagnosis decision-making device;
the network measurer is used for measuring the running state of the programmable network element equipment at fixed time and reporting the running state to the diagnosis decision-making device;
the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state data uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor;
and the fault processor remotely controls the programmable network element equipment diagnosed to be in a fault state, thermally restarting the system, initializing the programmable network element equipment by using the flow table backed up in the data acquisition unit, and re-accessing the network.
2. The system according to claim 1, wherein the data collector makes and marks a data set for related working state data of the programmable network element device in a conventional class use scenario, and marks task information and marks diagnostic elements in the conventional class scenario.
3. The fault diagnosis system of programmable network element equipment according to claim 2, wherein the conventional class scene comprises a normal operation scene and a built fault scene, and the way of marking diagnosis elements comprises semantic recognition, entity recognition and data cleaning.
4. The system of claim 1, wherein the diagnostic decision maker comprises a communication module, a diagnostic algorithm module, and a policy generation module;
the communication module is communicated with the distributed data acquisition device on one hand, acquires the dependency relationship, the cooperative operation logic and the identification of the communication stage among the working tasks of the programmable network element equipment, is communicated with the fault processor on the other hand, issues the fault processing rule generated by the strategy generation module, and acquires the measurement data of the programmable network element equipment after the data is calculated by the fault processor;
the diagnosis algorithm module reads the dependency relationship and the cooperative task identification between the cooperative tasks of the programmable network element equipment, calculates the starting priority of the cooperative tasks by using a topology sequencing algorithm and outputs the starting priority to the strategy generation module; when topological ordering is carried out on the dependency relationship of the collaborative tasks, the collaborative tasks started at the same time are marked as the same priority;
the strategy generation module is used for formulating specific processing rules according to the function limit of the work task and the work task priority with the dependency relationship generated by the diagnosis algorithm module, wherein the function limit comprises the number limit of the work task priority channels supported by the programmable network element equipment.
5. The system of claim 1, wherein the network measurer provides real-time device status information feedback to the diagnostic decision maker by monitoring key indicators and performance parameters.
6. The fault diagnosis system of programmable network element equipment according to claim 4, wherein the diagnosis algorithm module takes data collected by a massive data collector and feedback data of the operation condition of the programmable network element equipment as input, builds an AI model through the dependency relationship and collaborative operation logic of the deep neural network training and reasoning completion work tasks, iterates own model parameters through continuous interaction with real network environment information under a deep reinforcement learning framework, and modifies own parameters through output excitation, thereby improving the accuracy and effect of the model.
7. The system according to claim 4, wherein the policy generation module generates processing rule information according to the function limit of the work task and the cooperative task priority with the dependency relationship output by the diagnostic algorithm module, and sends the processing rule information to the fault processor through the communication module.
8. The system according to claim 4, wherein the communication module adds the processing rule information sent by the policy generation module to the transmission message queue after receiving the device status information sent by the network measurer.
9. The system according to claim 1, wherein the fault processor parses the policy message sent by the communication module after receiving the policy message, extracts location information of the programmable network element, and performs remote control.
10. A fault diagnosis method for a programmable network element device, characterized in that the method is implemented based on the fault diagnosis system for a programmable network element device according to any one of claims 1 to 9, the method specifically comprising:
the data collector is used for storing a globally consistent flow table on the programmable network element equipment in a file form as a backup, collecting the dependency relationship and cooperative operation logic between the working tasks of the programmable network element equipment and uploading the dependency relationship and cooperative operation logic to the diagnosis decision-making device;
the running state of the programmable network element equipment is measured at fixed time through the network measurer and reported to the diagnosis decision-making device;
the diagnosis decision device takes the data uploaded by the data acquisition device as the input of the deep neural network, trains a fault diagnosis model, diagnoses and decides the running state uploaded by the network measurer based on the fault diagnosis model, and issues processing rule information to the fault processor;
and remotely controlling the programmable network element equipment diagnosed to be in a fault state through the fault processor, thermally restarting the system, initializing the programmable network element equipment by using the backup flow table in the data acquisition unit, and re-accessing the network.
CN202311238378.0A 2023-09-25 2023-09-25 Fault diagnosis system and fault diagnosis method for programmable network element equipment Active CN116980279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311238378.0A CN116980279B (en) 2023-09-25 2023-09-25 Fault diagnosis system and fault diagnosis method for programmable network element equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311238378.0A CN116980279B (en) 2023-09-25 2023-09-25 Fault diagnosis system and fault diagnosis method for programmable network element equipment

Publications (2)

Publication Number Publication Date
CN116980279A true CN116980279A (en) 2023-10-31
CN116980279B CN116980279B (en) 2023-12-12

Family

ID=88473513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311238378.0A Active CN116980279B (en) 2023-09-25 2023-09-25 Fault diagnosis system and fault diagnosis method for programmable network element equipment

Country Status (1)

Country Link
CN (1) CN116980279B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110216680A (en) * 2019-07-05 2019-09-10 山东大学 A kind of service robot cloud ground collaborative fault diagnosis system and method
US20190384790A1 (en) * 2016-02-05 2019-12-19 Sas Institute Inc. Staged training of neural networks for improved time series prediction performance
CN111666982A (en) * 2020-05-19 2020-09-15 上海核工程研究设计院有限公司 Electromechanical equipment fault diagnosis method based on deep neural network
CN112086214A (en) * 2020-09-23 2020-12-15 中国核动力研究设计院 Nuclear power station key equipment remote state monitoring and intelligent diagnosis platform
CN114553671A (en) * 2022-02-28 2022-05-27 国家电网有限公司 Diagnosis method for power communication network fault alarm
US20220357733A1 (en) * 2021-05-07 2022-11-10 Servicenow, Inc. Detection and Correction of Robotic Process Automation Failures
US20230025081A1 (en) * 2021-07-23 2023-01-26 EMC IP Holding Company LLC Model training method, failure determining method, electronic device, and program product
WO2023035869A1 (en) * 2022-03-15 2023-03-16 中国长江三峡集团有限公司 Gearbox fault diagnosis model training method and gearbox fault diagnosis method
WO2023138337A1 (en) * 2022-01-18 2023-07-27 华为技术有限公司 Motor fault detection method and apparatus
CN116629627A (en) * 2023-04-25 2023-08-22 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Intelligent detection system of power transmission on-line monitoring device
CN116684358A (en) * 2023-07-31 2023-09-01 之江实验室 Flow table management system and method for programmable network element equipment
CN116723085A (en) * 2023-07-14 2023-09-08 苏州浪潮智能科技有限公司 Service conflict processing method and device, storage medium and electronic device
CN116756642A (en) * 2023-06-13 2023-09-15 江苏长田信息科技有限公司 Industrial material conveying line monitoring method based on multi-mode decision fusion

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190384790A1 (en) * 2016-02-05 2019-12-19 Sas Institute Inc. Staged training of neural networks for improved time series prediction performance
CN110216680A (en) * 2019-07-05 2019-09-10 山东大学 A kind of service robot cloud ground collaborative fault diagnosis system and method
CN111666982A (en) * 2020-05-19 2020-09-15 上海核工程研究设计院有限公司 Electromechanical equipment fault diagnosis method based on deep neural network
CN112086214A (en) * 2020-09-23 2020-12-15 中国核动力研究设计院 Nuclear power station key equipment remote state monitoring and intelligent diagnosis platform
US20220357733A1 (en) * 2021-05-07 2022-11-10 Servicenow, Inc. Detection and Correction of Robotic Process Automation Failures
US20230025081A1 (en) * 2021-07-23 2023-01-26 EMC IP Holding Company LLC Model training method, failure determining method, electronic device, and program product
WO2023138337A1 (en) * 2022-01-18 2023-07-27 华为技术有限公司 Motor fault detection method and apparatus
CN114553671A (en) * 2022-02-28 2022-05-27 国家电网有限公司 Diagnosis method for power communication network fault alarm
WO2023035869A1 (en) * 2022-03-15 2023-03-16 中国长江三峡集团有限公司 Gearbox fault diagnosis model training method and gearbox fault diagnosis method
CN116629627A (en) * 2023-04-25 2023-08-22 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Intelligent detection system of power transmission on-line monitoring device
CN116756642A (en) * 2023-06-13 2023-09-15 江苏长田信息科技有限公司 Industrial material conveying line monitoring method based on multi-mode decision fusion
CN116723085A (en) * 2023-07-14 2023-09-08 苏州浪潮智能科技有限公司 Service conflict processing method and device, storage medium and electronic device
CN116684358A (en) * 2023-07-31 2023-09-01 之江实验室 Flow table management system and method for programmable network element equipment

Also Published As

Publication number Publication date
CN116980279B (en) 2023-12-12

Similar Documents

Publication Publication Date Title
CN111191897B (en) Business process online compliance prediction method and system based on bidirectional GRU neural network
US7996723B2 (en) Continuous, automated discovery of bugs in released software
JP4852043B2 (en) System, device, and method for updating system monitoring model
WO2014173257A1 (en) Reliability analysis method and device for state evaluation of operator
CN103226563B (en) To the method and system that the client activities in automatic client back-up system are classified
CN111124852A (en) Fault prediction method and system based on BMC health management module
CN117171576A (en) Abnormality monitoring method and system applied to material purification system
CN114647525A (en) Diagnostic method, diagnostic device, terminal and storage medium
CN116088469A (en) Expert system-based generalized fault diagnosis platform system
JP3579834B2 (en) Proactive online diagnostics in manageable networks
CN116980279B (en) Fault diagnosis system and fault diagnosis method for programmable network element equipment
US8042024B2 (en) Method, system, and computer program product for reconstructing a data stream
Larrinaga et al. Implementation of a reference architecture for cyber physical systems to support condition based maintenance
CN110188040A (en) A kind of software platform for software systems fault detection and health state evaluation
CN116346392A (en) Network security situation prediction method and system based on Tranformer-CNN model and application thereof
CN115712874A (en) Thermal energy power system fault diagnosis method and device based on time series characteristics
JP2008181299A (en) Communication error information output program, communication error information output method, and communication error information output device
KR101347748B1 (en) Autonomic computing apparatus and method in cyber physical systems
Ren et al. Triple: The Interpretable Deep Learning Anomaly Detection Framework based on Trace-Metric-Log of Microservice
CN112905758B (en) Intelligent training management method and system based on telephone robot
TWI837526B (en) Telecommunication voice call obstacle root cause analysis system, method and computer readable media
CN118113554A (en) Abnormality monitoring method and device for software development, storage medium and electronic equipment
Deb et al. Validation of a COTS EHM Solution for the JSF Program
CN116192620A (en) Fault detection model training method, fault detection method, device and system
CN116743546A (en) Cloud desktop fault detection method and device, server and cloud desktop system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant