Disclosure of Invention
In view of the above, an object of the present invention is to provide a service function chain fault detection method based on prediction, which can effectively detect a node abnormal condition according to a change of performance data of a VNF node in a network virtualization environment, and meet a reliability requirement of a network.
In order to achieve the purpose, the invention provides the following technical scheme:
a service function chain fault detection method based on prediction is characterized by specifically comprising the following steps:
s1: collecting data by monitoring performance data of each VNF on the whole service function chain according to performance correlation existing among VNF nodes by combining fault propagation characteristics under a service function chain scene, and dividing the working state of the data into normal state, service quality reduction or fault state;
s2: aiming at the high-dimensional complexity and time-related characteristics of network monitoring data, combining the initiative requirement of fault detection, adopting a Gated Recurrent Unit (GRU) network to detect faults, and predicting the health condition of the network by analyzing the historical performance data information of a service function chain; aiming at the problem that modeling of a GRU network requires a long time and is not beneficial to the real-time requirement of fault detection, migration learning is introduced to accelerate the convergence speed of the model by utilizing the similarity of VNF nodes among different service function chains.
Further, in the step S1, in the service function chain scenario, the service function chain is generally formed by arranging a plurality of VNFs independent of each other according to a specific requirement of a user in a certain order, VNFs in different service function chains may partially overlap, and such a characteristic of the service function chain is very likely to cause a VNF fault to propagate to an associated VNF, thereby causing a slice network large-area fault.
Considering that the fault has the characteristic of spreading among different VNF nodes in the service function chain, the performance data is collected in a mode of monitoring the working state of each VNF node of the whole service function chain at the application layer of the slice network, the occurrence of the fault is detected through the continuous analysis of the performance data by the detection system, and the first VNF node showing the fault is taken as the starting point of the occurrence of the fault.
Further, in order to eliminate the influence of the problems of different dimensions, different data value ranges, unobvious data change trend and the like in the original data sample on model training and improve the precision of the model and the network training speed, a linear maximum and minimum method is required to be adopted to carry out normalization preprocessing on the performance data collected from all VNF nodes, and the conversion function is as follows:
further, in step S1, the operating status of the service function chain is divided into three categories according to the cause of the fault:
(1) and (3) normal: the network state runs well;
(2) the service quality is reduced: the situation that network load is increased, flow is reduced, time delay is increased or packet loss occurs, but the VNF can still work; the specific reasons may be sudden change of the surrounding environment, software and hardware problems or insufficient resources, and the situation that the system is recovered to a normal working state in a short time or continuously deteriorated and becomes a fault may occur;
(3) and (4) failure: the network function can not be used at all, the time delay becomes infinite, and corresponding service can not be provided for the user any more; in order to ensure the normal operation of the SFC where the VNF node is located, operations such as node migration or software and hardware device restart need to be performed.
Further, the rule for executing the node migration is as follows: the failed VNF node may be generated by a mutation of a normal VNF node, or may be generated by a mutation of a VNF node in the process of decreasing the service quality, which is degraded to a certain extent or in the process of decreasing the performance. For VNF nodes that are in a reduced quality of service, if they can adapt to the optimal regulation of the system, they revert to normal operation or deteriorate into a failure, and the restoration of network functions needs to be achieved through necessary healing measures.
Further, in step S2, for the high-dimensional complexity and time-related characteristics of the network monitoring data, and in combination with the initiative requirement of fault detection, the method for detecting the fault by using the GRU network specifically includes the following steps:
s201: processing input time sequence sample data by using a three-layer GRU unit, and training a model in a small batch mode by using a historical monitoring data set;
s202: after passing through the three GRU networks, the characteristic information of the previous network is integrated through a full connection layer, so that the learning capability of the network is improved;
s203: the output of the full connection layer is used as the input of a softmax classifier, and reverse supervised fine tuning is carried out by combining tag data;
s204: the parameters are further optimized using real-time monitoring data.
Further, in step S2, predicting the health condition of the network by analyzing the historical performance data information of the service function chain specifically includes:
the historical performance data is assumed to be waiting time delay and processing time delay; in a training stage, firstly, feature acquisition is carried out on waiting time delay and processing time delay of all VNF nodes on a service function chain, a certain service function chain is set to be composed of m VNF nodes, monitoring data of all VNF nodes at each moment are recorded, the length of a sliding window is defined as d, and then in a time range from t-d to t, an input data set of a network model is represented as x ═ x { (x-x) } xt,xt-1,…,xt-d+1At time t, the data set of all VNF nodes is:
wherein,
and
respectively representing the waiting delay and the processing delay of the mth VNF;
the prediction method comprises the following steps: since the input of the GRU network model is time-series data, it is necessary to construct time-series samples by a sliding window method, and to use the length d as the size of a sliding window according to the length dThe time step h moves on the data set to obtain the sample X at the current momentt={xt,xt-1,…,xt-d+1And sample X at the next time instantt+h={xt+h,xt+h-1,…,xt+h-d+1And determining the label value as x according to the network actual state at the next momentt+1And xt+h+1;
Then, according to the dimension of GRU input, dividing a training set and a test set according to a certain proportion, training a model in a small batch mode, integrating the characteristic information of the previous network through a full connection layer after passing through a three-layer GRU network, improving the learning capacity of the network, and finally taking the output of the full connection layer as the input of a softmax classifier to obtain a final prediction result; in order to prevent overfitting when the network is trained, partial repeated information generated in the training process is discarded in a Dropout regularization mode;
training of a network model: the method is a process for continuously optimizing the model parameters, and in order to reduce the error between an output result and a real network, a back propagation algorithm is used for carrying out iterative updating on the network parameters; optimizing the network weight layer by layer in a gradient descending mode to enable the value of the target loss function to be minimum; compared with other parameter optimization algorithms, the Adam algorithm has the advantages of calculation efficiency, convergence speed and the like, so that the Adam algorithm with the self-adaptive learning rate is adopted to accelerate the convergence speed of the algorithm.
Further, the Adam algorithm dynamically adjusts the learning rate of each parameter by using the distance estimation of the gradient, is suitable for a large data set and a high-dimensional space, and is more modern as follows:
wherein theta is an iteration parameter, is a learning rate,
and
the bias correction for the first order estimate and the bias correction for the second order estimate of the gradient, respectively, are a smoothing term.
Further, in step S2, the introducing of transfer learning to accelerate the convergence rate of the model specifically includes: in a network slicing scene, the size of a monitoring data set is limited by the length of slicing operation time, and for a service function chain in the early stage of slicing operation, the condition that fault detection accuracy is not high due to insufficient monitoring data may occur, and a parameter-based transfer learning method needs to be introduced to accelerate the convergence speed of a model, so that a fault detection model based on GRU neural network prediction can keep higher detection accuracy under the condition of a smaller data sample.
The selected source domain model parameters are trained from service function chain data with similar performance index requirements to the target domain; then, migrating the fault detection model parameters in other service function chains similar to the current service function chain structure to the current service function chain to help the current service function chain fault detection model to obtain better training effect; the method comprises the following specific steps: utilizing a historical data sample set of a service function chain SFC b to carry out fault detection model training of a GRU network of the SFCb, obtaining an optimal parameter matrix of model convergence, and taking W as the referencei bFor example, let the migration ratio phi (t) ∈ (0, 1) represent the degree of migration of a parameter from the SFC b model to the SFC a model, where phi (t) is 1/t and decreases as time t increases, let it be known that the parameter matrix of SFC a is Wi a=φ(t)Wi b+(1-φ(t))Wi aParameter W at initial timei a=Wi bTraining and fine tuning the model by using the data sample set of the SFC a to obtain the optimal GRU network model parameter W of the SFC ai a'。
The invention has the beneficial effects that: aiming at the problem of fault detection in a 5G end-to-end network slicing scene, the invention can effectively extract massive and high-dimensional data characteristics in a complex network on the basis of meeting the requirement of the system on detection accuracy, simultaneously ensures the timeliness of fault detection and has high application value in a wireless communication system.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Referring to fig. 1 to 4, fig. 1 is a schematic view of a scenario in which the present invention can be implemented. Referring to fig. 1, the application layer is mainly responsible for providing an ordered set of VNFs for each sliced request to process traffic, and the infrastructure layer provides physical nodes and links of many types of resources, such as computational resources, bandwidth resources, storage resources, etc., that support various sliced network functional requirements. The virtualization layer realizes functions of slice life cycle management, network performance data monitoring and the like through the NFV MANO and the SDN controller. The system generates a specific service function chain according to different service requests, thereby meeting the service requirements of users. The traffic, bandwidth, latency and other requirements of different VNFs in each service function chain are also different, but there is a certain correlation between the resources required by two adjacent VNFs and the virtual link connecting them. In order to ensure the stability and the service quality of the network, the node states of different VNFs of the whole service function chain need to be monitored, and the occurrence of a fault is detected in time.
FIG. 2 is a node state transition diagram in the present embodiment. The failed VNF node may be generated by a mutation of a normal VNF node, or may be generated by a mutation of a VNF node in the process of decreasing the service quality, which is degraded to a certain extent or in the process of decreasing the performance. For VNF nodes that are in a reduced quality of service, if the optimal adjustment of the system can be adapted, it may be restored to a normal operating state, otherwise, it deteriorates to a failure, and the restoration of the network function needs to be achieved through necessary healing measures.
Fig. 3 is a fault detection model based on a GRU network in an embodiment of the present invention. Referring to fig. 3, the fault detection model based on GRU network prediction of the present invention is composed of three layers of GRU units, a full connectivity layer and a softmax classifier. Defining each input sample at time t as the health status feature information x of all VNF nodes on the same service function chaintAnd the hidden layer state h of the model at the moment can be obtained by training the GRU network modeltAnd then derive the predicted state y of each VNFt+1. Since the state of the VNF at the next time is affected by the observed data at that time and the state of the hidden layer at the previous time, the GRU network can use the historical observed data for the health state prediction of future VNF nodes.
Fig. 4 is a flowchart illustrating a service function chain fault detection method according to an embodiment of the present invention. The method comprises the following specific steps:
step 401: initializing system parameters, wherein the GRU network model parameters including SFC b and SFC a, the learning rate and the iteration times k are 0;
step 402: inputting historical performance data of all VNF nodes of the SFC b and real-time performance data of all VNF nodes of the SFC a before t moment;
step 403: carrying out normalization preprocessing on input data;
step 404: constructing time sequence input data according to the length of the sliding window and the time step;
step 405: training a GRU network model by using historical performance data of all VNF nodes of the SFC b, carrying out reverse fine adjustment on the model by using an Adam algorithm of a self-adaptive learning rate, and updating corresponding parameters;
step 406: and judging whether the convergence condition is met. If the convergence condition is not satisfied, let k be k +1, continue to execute step 405, otherwise, extract the latest model parameter and execute step 407;
step 407: transferring model parameters extracted from the GRU network model of the SFC b to the GRU network model of the SFC a, and further training the model by utilizing real-time performance data before t moment in the SFC a;
step 408: and judging whether the maximum iteration number K is reached. If not, let k be k +1, go to step 407, otherwise execute step 409;
step 409: and outputting the working state of each VNF node in the SFC a at the moment of t + 1.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.