CN111526070A

CN111526070A - Service function chain fault detection method based on prediction

Info

Publication number: CN111526070A
Application number: CN202010359298.0A
Authority: CN
Inventors: 唐伦; 廖皓; 贺兰钦; 胡彦娟; 陈前斌
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Hangzhou Mingshi Technology Co ltd; Shenzhen Wanzhida Technology Transfer Center Co ltd
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-11
Anticipated expiration: 2040-04-29
Also published as: CN111526070B

Abstract

The invention relates to a service function chain fault detection method based on prediction, and belongs to the technical field of communication. According to the method, firstly, according to the performance correlation existing among the virtual network functions of a service function chain under a virtual network environment, data are collected in a mode of monitoring the performance data of each virtual network function on the whole service function chain, and the working state of the data is divided; secondly, considering the high-dimensional complexity and time-related characteristics of network monitoring data, combining the initiative requirement of fault detection, adopting a gated cycle unit network to detect faults, and predicting the health condition of the network by analyzing the historical performance data information of a service function chain; and migration learning is introduced to accelerate the convergence speed of the model by utilizing the similarity of the virtual network function nodes among different service function chains. The invention can effectively detect the occurrence of the fault while meeting the requirement of fault detection time.

Description

Service function chain fault detection method based on prediction

Technical Field

The invention belongs to the technical field of communication, and relates to a service function chain fault detection method based on prediction.

Background

The arrival of 5G will bring a revolution across the era, and the 5G network will provide differentiated business scenes and customized network services for the colorful service demands of people. The existing network architecture is mainly oriented to human-human communication, and is difficult to support emerging services such as Internet of things and Internet of vehicles and the like which need to realize large-scale machine communication. Therefore, 5G considers network slicing as an ideal network architecture as a key to solving the above problems. The 5G network slice architecture based on the SDN and the NFV can be flexibly arranged according to user requirements, and resources can be dynamically and efficiently allocated. The NFV technology decouples software and hardware of the traditional network, realizes network functions in a software mode, and is more beneficial to realizing network resource sharing.

However, the network slice architecture brings great flexibility to the 5G network, and also puts new requirements on the reliability of the network. Due to the sharing of the underlying resources, a failure of an underlying Network node easily causes a failure of multiple Virtual Network Functions (VNFs) sharing the node resource, resulting in functional paralysis of multiple Service Function Chains (SFCs). Therefore, it is very important to realize fast recovery of a network from a fault by using the SON technology, where fault detection is used as a main body of network performance analysis and is a primary premise for realizing a self-healing measure.

The inventor finds that the following disadvantages exist in the process of researching the prior art:

most of the existing fault detection methods are based on a reaction mechanism, but the mechanism is difficult to avoid the inherent delay of the reaction and is not beneficial to the timeliness requirement of the network; most methods do not consider the condition of network service quality reduction, and neglect the adverse effect of continuous performance reduction on the network; the failure in the service function chain scenario is not considered to have the characteristics of longitudinal propagation between an infrastructure layer and an application layer and transverse propagation between different service function chains, which will cause the performance abnormality of some normal VNF nodes, and it is difficult to find the VNF node position where the failure first occurs.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a service function chain fault detection method based on prediction, which can effectively detect a node abnormal condition according to a change of performance data of a VNF node in a network virtualization environment, and meet a reliability requirement of a network.

In order to achieve the purpose, the invention provides the following technical scheme:

a service function chain fault detection method based on prediction is characterized by specifically comprising the following steps:

s1: collecting data by monitoring performance data of each VNF on the whole service function chain according to performance correlation existing among VNF nodes by combining fault propagation characteristics under a service function chain scene, and dividing the working state of the data into normal state, service quality reduction or fault state;

s2: aiming at the high-dimensional complexity and time-related characteristics of network monitoring data, combining the initiative requirement of fault detection, adopting a Gated Recurrent Unit (GRU) network to detect faults, and predicting the health condition of the network by analyzing the historical performance data information of a service function chain; aiming at the problem that modeling of a GRU network requires a long time and is not beneficial to the real-time requirement of fault detection, migration learning is introduced to accelerate the convergence speed of the model by utilizing the similarity of VNF nodes among different service function chains.

Further, in the step S1, in the service function chain scenario, the service function chain is generally formed by arranging a plurality of VNFs independent of each other according to a specific requirement of a user in a certain order, VNFs in different service function chains may partially overlap, and such a characteristic of the service function chain is very likely to cause a VNF fault to propagate to an associated VNF, thereby causing a slice network large-area fault.

Considering that the fault has the characteristic of spreading among different VNF nodes in the service function chain, the performance data is collected in a mode of monitoring the working state of each VNF node of the whole service function chain at the application layer of the slice network, the occurrence of the fault is detected through the continuous analysis of the performance data by the detection system, and the first VNF node showing the fault is taken as the starting point of the occurrence of the fault.

Further, in order to eliminate the influence of the problems of different dimensions, different data value ranges, unobvious data change trend and the like in the original data sample on model training and improve the precision of the model and the network training speed, a linear maximum and minimum method is required to be adopted to carry out normalization preprocessing on the performance data collected from all VNF nodes, and the conversion function is as follows:

further, in step S1, the operating status of the service function chain is divided into three categories according to the cause of the fault:

(1) and (3) normal: the network state runs well;

(2) the service quality is reduced: the situation that network load is increased, flow is reduced, time delay is increased or packet loss occurs, but the VNF can still work; the specific reasons may be sudden change of the surrounding environment, software and hardware problems or insufficient resources, and the situation that the system is recovered to a normal working state in a short time or continuously deteriorated and becomes a fault may occur;

(3) and (4) failure: the network function can not be used at all, the time delay becomes infinite, and corresponding service can not be provided for the user any more; in order to ensure the normal operation of the SFC where the VNF node is located, operations such as node migration or software and hardware device restart need to be performed.

Further, the rule for executing the node migration is as follows: the failed VNF node may be generated by a mutation of a normal VNF node, or may be generated by a mutation of a VNF node in the process of decreasing the service quality, which is degraded to a certain extent or in the process of decreasing the performance. For VNF nodes that are in a reduced quality of service, if they can adapt to the optimal regulation of the system, they revert to normal operation or deteriorate into a failure, and the restoration of network functions needs to be achieved through necessary healing measures.

Further, in step S2, for the high-dimensional complexity and time-related characteristics of the network monitoring data, and in combination with the initiative requirement of fault detection, the method for detecting the fault by using the GRU network specifically includes the following steps:

s201: processing input time sequence sample data by using a three-layer GRU unit, and training a model in a small batch mode by using a historical monitoring data set;

s202: after passing through the three GRU networks, the characteristic information of the previous network is integrated through a full connection layer, so that the learning capability of the network is improved;

s203: the output of the full connection layer is used as the input of a softmax classifier, and reverse supervised fine tuning is carried out by combining tag data;

s204: the parameters are further optimized using real-time monitoring data.

Further, in step S2, predicting the health condition of the network by analyzing the historical performance data information of the service function chain specifically includes:

the historical performance data is assumed to be waiting time delay and processing time delay; in a training stage, firstly, feature acquisition is carried out on waiting time delay and processing time delay of all VNF nodes on a service function chain, a certain service function chain is set to be composed of m VNF nodes, monitoring data of all VNF nodes at each moment are recorded, the length of a sliding window is defined as d, and then in a time range from t-d to t, an input data set of a network model is represented as x ═ x { (x-x) } x_t,x_t-1,…,x_t-d+1At time t, the data set of all VNF nodes is:

wherein,

and

respectively representing the waiting delay and the processing delay of the mth VNF;

the prediction method comprises the following steps: since the input of the GRU network model is time-series data, it is necessary to construct time-series samples by a sliding window method, and to use the length d as the size of a sliding window according to the length dThe time step h moves on the data set to obtain the sample X at the current moment_t＝{x_t,x_t-1,…,x_t-d+1And sample X at the next time instant_t+h＝{x_t+h,x_t+h-1,…,x_t+h-d+1And determining the label value as x according to the network actual state at the next moment_t+1And x_t+h+1；

Then, according to the dimension of GRU input, dividing a training set and a test set according to a certain proportion, training a model in a small batch mode, integrating the characteristic information of the previous network through a full connection layer after passing through a three-layer GRU network, improving the learning capacity of the network, and finally taking the output of the full connection layer as the input of a softmax classifier to obtain a final prediction result; in order to prevent overfitting when the network is trained, partial repeated information generated in the training process is discarded in a Dropout regularization mode;

training of a network model: the method is a process for continuously optimizing the model parameters, and in order to reduce the error between an output result and a real network, a back propagation algorithm is used for carrying out iterative updating on the network parameters; optimizing the network weight layer by layer in a gradient descending mode to enable the value of the target loss function to be minimum; compared with other parameter optimization algorithms, the Adam algorithm has the advantages of calculation efficiency, convergence speed and the like, so that the Adam algorithm with the self-adaptive learning rate is adopted to accelerate the convergence speed of the algorithm.

Further, the Adam algorithm dynamically adjusts the learning rate of each parameter by using the distance estimation of the gradient, is suitable for a large data set and a high-dimensional space, and is more modern as follows:

wherein theta is an iteration parameter, is a learning rate,

and

the bias correction for the first order estimate and the bias correction for the second order estimate of the gradient, respectively, are a smoothing term.

Further, in step S2, the introducing of transfer learning to accelerate the convergence rate of the model specifically includes: in a network slicing scene, the size of a monitoring data set is limited by the length of slicing operation time, and for a service function chain in the early stage of slicing operation, the condition that fault detection accuracy is not high due to insufficient monitoring data may occur, and a parameter-based transfer learning method needs to be introduced to accelerate the convergence speed of a model, so that a fault detection model based on GRU neural network prediction can keep higher detection accuracy under the condition of a smaller data sample.

The selected source domain model parameters are trained from service function chain data with similar performance index requirements to the target domain; then, migrating the fault detection model parameters in other service function chains similar to the current service function chain structure to the current service function chain to help the current service function chain fault detection model to obtain better training effect; the method comprises the following specific steps: utilizing a historical data sample set of a service function chain SFC b to carry out fault detection model training of a GRU network of the SFCb, obtaining an optimal parameter matrix of model convergence, and taking W as the reference_i ^bFor example, let the migration ratio phi (t) ∈ (0, 1) represent the degree of migration of a parameter from the SFC b model to the SFC a model, where phi (t) is 1/t and decreases as time t increases, let it be known that the parameter matrix of SFC a is W_i ^a＝φ(t)W_i ^b+(1-φ(t))W_i ^aParameter W at initial time_i ^a＝W_i ^bTraining and fine tuning the model by using the data sample set of the SFC a to obtain the optimal GRU network model parameter W of the SFC a_i ^a'。

The invention has the beneficial effects that: aiming at the problem of fault detection in a 5G end-to-end network slicing scene, the invention can effectively extract massive and high-dimensional data characteristics in a complex network on the basis of meeting the requirement of the system on detection accuracy, simultaneously ensures the timeliness of fault detection and has high application value in a wireless communication system.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram of a scenario in which the present invention may be applied;

FIG. 2 is a node state transition diagram according to an embodiment of the present invention;

FIG. 3 is a fault detection model based on a GRU network in an embodiment of the present invention;

fig. 4 is a flowchart illustrating a service function chain fault detection method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1 to 4, fig. 1 is a schematic view of a scenario in which the present invention can be implemented. Referring to fig. 1, the application layer is mainly responsible for providing an ordered set of VNFs for each sliced request to process traffic, and the infrastructure layer provides physical nodes and links of many types of resources, such as computational resources, bandwidth resources, storage resources, etc., that support various sliced network functional requirements. The virtualization layer realizes functions of slice life cycle management, network performance data monitoring and the like through the NFV MANO and the SDN controller. The system generates a specific service function chain according to different service requests, thereby meeting the service requirements of users. The traffic, bandwidth, latency and other requirements of different VNFs in each service function chain are also different, but there is a certain correlation between the resources required by two adjacent VNFs and the virtual link connecting them. In order to ensure the stability and the service quality of the network, the node states of different VNFs of the whole service function chain need to be monitored, and the occurrence of a fault is detected in time.

FIG. 2 is a node state transition diagram in the present embodiment. The failed VNF node may be generated by a mutation of a normal VNF node, or may be generated by a mutation of a VNF node in the process of decreasing the service quality, which is degraded to a certain extent or in the process of decreasing the performance. For VNF nodes that are in a reduced quality of service, if the optimal adjustment of the system can be adapted, it may be restored to a normal operating state, otherwise, it deteriorates to a failure, and the restoration of the network function needs to be achieved through necessary healing measures.

Fig. 3 is a fault detection model based on a GRU network in an embodiment of the present invention. Referring to fig. 3, the fault detection model based on GRU network prediction of the present invention is composed of three layers of GRU units, a full connectivity layer and a softmax classifier. Defining each input sample at time t as the health status feature information x of all VNF nodes on the same service function chain_tAnd the hidden layer state h of the model at the moment can be obtained by training the GRU network model_tAnd then derive the predicted state y of each VNF_t+1. Since the state of the VNF at the next time is affected by the observed data at that time and the state of the hidden layer at the previous time, the GRU network can use the historical observed data for the health state prediction of future VNF nodes.

Fig. 4 is a flowchart illustrating a service function chain fault detection method according to an embodiment of the present invention. The method comprises the following specific steps:

step 401: initializing system parameters, wherein the GRU network model parameters including SFC b and SFC a, the learning rate and the iteration times k are 0;

step 402: inputting historical performance data of all VNF nodes of the SFC b and real-time performance data of all VNF nodes of the SFC a before t moment;

step 403: carrying out normalization preprocessing on input data;

step 404: constructing time sequence input data according to the length of the sliding window and the time step;

step 405: training a GRU network model by using historical performance data of all VNF nodes of the SFC b, carrying out reverse fine adjustment on the model by using an Adam algorithm of a self-adaptive learning rate, and updating corresponding parameters;

step 406: and judging whether the convergence condition is met. If the convergence condition is not satisfied, let k be k +1, continue to execute step 405, otherwise, extract the latest model parameter and execute step 407;

step 407: transferring model parameters extracted from the GRU network model of the SFC b to the GRU network model of the SFC a, and further training the model by utilizing real-time performance data before t moment in the SFC a;

step 408: and judging whether the maximum iteration number K is reached. If not, let k be k +1, go to step 407, otherwise execute step 409;

step 409: and outputting the working state of each VNF node in the SFC a at the moment of t + 1.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A service function chain fault detection method based on prediction is characterized by specifically comprising the following steps:

s1: collecting data by monitoring performance data of each Virtual Network Function (VNF) on the whole service function chain according to performance correlation existing between VNF nodes by combining fault propagation characteristics under a service function chain scene, and dividing the working state of the VNF into normal state, service quality reduction or fault state;

2. The method according to claim 1, wherein in the step S1, in the service function chain scenario, the service function chain is formed by arranging a plurality of VNFs independent of functions according to a specific requirement of a user in a certain order; the method comprises the steps of collecting performance data by monitoring the working state of each VNF node of the whole service function chain on the application layer of the slice network, detecting the occurrence of faults through continuous analysis of the performance data by a detection system, and taking the first VNF node showing the faults as the starting point of the occurrence of the faults.

3. The prediction-based service function chain fault detection method of claim 2, wherein the performance data collected from all VNF nodes is normalized and preprocessed by a linear maximum-minimum method.

4. The method according to claim 1, wherein in step S1, the operation status of the service function chain is divided into three categories according to the cause of the failure:

(1) and (3) normal: the network state runs well;

(2) the service quality is reduced: network load is increased, flow is reduced, time delay is increased or packet loss occurs, but the VNF can still work;

(3) and (4) failure: the network function can not be used at all, the time delay becomes infinite, and corresponding service can not be provided for the user any more; and carrying out node migration or software and hardware equipment restart.

5. The prediction-based service function chain fault detection method according to claim 4, wherein the rule for performing node migration is: for VNF nodes in a reduced quality of service, if the optimal adjustment of the system can be adapted, the VNF nodes are restored to a normal operating state or deteriorated to a failure, and the restoration of the network function needs to be achieved through a healing measure.

6. The method according to claim 1, wherein in the step S2, for the high-dimensional complexity and time-related characteristics of the network monitoring data, the method for detecting the fault by using the GRU network in combination with the initiative requirement of fault detection specifically includes the following steps:

s204: the parameters are further optimized using real-time monitoring data.

7. The method according to claim 6, wherein the step S2 of predicting the health condition of the network by analyzing the historical performance data information of the service function chain specifically comprises:

the historical performance data is assumed to be waiting time delay and processing time delay; in trainingIn the training stage, firstly, feature acquisition is carried out on the waiting time delay and the processing time delay of all VNF nodes on a service function chain, a certain service function chain is set to be composed of m VNF nodes, monitoring data of all VNF nodes at each moment are recorded, the length of a sliding window is defined to be d, and then in the time range from t-d to t, the input data set of the network model is represented as x ═ x { (x-x)_t,x_t-1,…,x_t-d+1At time t, the data set of all VNF nodes is:

wherein,

and

the prediction method comprises the following steps: constructing a time sequence sample by a sliding window method, taking the length d as the size of a sliding window, moving on a data set according to a time step h, and obtaining a sample X at the current moment_t＝{x_t,x_t-1,…,x_t-d+1And sample X at the next time instant_t+h＝{x_t+h,x_t+h-1,…,x_t+h-d+1And determining the label value as x according to the network actual state at the next moment_t+1And x_t+h+1；

training of a network model: iteratively updating the network parameters by using a back propagation algorithm; optimizing the network weight layer by layer in a gradient descending mode to enable the value of the target loss function to be minimum; and the Adam algorithm of the self-adaptive learning rate is adopted to accelerate the convergence speed of the algorithm.

8. The prediction-based service function chain fault detection method according to claim 7, wherein the Adam algorithm dynamically adjusts the learning rate of each parameter by using the distance estimation of the gradient, and the update formula is as follows:

wherein theta is an iteration parameter, is a learning rate,

and

9. The method according to claim 1, wherein in the step S2, migration learning is introduced to accelerate the convergence speed of the model, and specifically includes: the selected source domain model parameters are trained from service function chain data with similar performance index requirements to the target domain; then, migrating the fault detection model parameters in other service function chains similar to the current service function chain structure to the current service function chain to help the current service function chain fault detection model to obtain better training effect; the method comprises the following specific steps: utilizing a historical data sample set of a Service Function Chain (SFC) b to carry out fault detection model training of a GRU network of the SFC b, and obtaining an optimal parameter matrix W of model convergence_i ^bLet the migration ratio φ (t) ∈ (0, 1) denote the degree of migration of a parameter from the SFC b model to the SFCa model,wherein phi (t) is 1/t; SFCa has a parameter matrix of W_i ^a＝φ(t)W_i ^b+(1-φ(t))W_i ^aParameter W at initial time_i ^a＝W_i ^bTraining and fine tuning the model by using the data sample set of SFCa to obtain the optimal GRU network model parameter W of SFCa_i ^a'。