CN109951327B

CN109951327B - Network fault data synthesis method based on Bayesian hybrid model

Info

Publication number: CN109951327B
Application number: CN201910165006.7A
Authority: CN
Inventors: 阴法明; 杜庆波
Original assignee: Nanjing College of Information Technology
Current assignee: Nanjing College of Information Technology
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2021-08-20
Anticipated expiration: 2039-03-05
Also published as: CN109951327A

Abstract

The invention discloses a network fault data synthesis method based on a Bayesian hybrid model, which is used for solving the defect of reduced prediction performance caused by less fault data in the conventional network fault prediction. By adopting the method, the characteristics of the network data set with the unbalanced characteristics can be accurately grasped, and the accuracy of network fault prediction is effectively improved.

Description

Network fault data synthesis method based on Bayesian hybrid model

Technical Field

The invention relates to a Bayesian hybrid model-based network fault data synthesis method, and belongs to the technical field of unbalanced data processing.

Background

With the development of internet technology, more and more users begin to use various types of network services. Network operators are also striving to provide higher quality and more stable transmission streaming video services to users. Due to the generation of network faults, the quality of user experience is easily reduced. In other words, if an operator can accurately predict a network failure in advance and take measures to solve problems that may occur in the network, the user experience can be effectively improved. Therefore, the prediction and timely handling of the user's failure is crucial for the network operator.

In an actual system, the proportion of network fault data in the whole network data set collected by the system is relatively small, in other words, the probability of network fault generation is far lower than the probability of network normal. Thus, the network data set has non-uniform characteristics. An unbalanced data set refers to a set of data in which one type of data is significantly less than the other type of data. Here, the amount of data for a network failure (few class samples) is much less than the amount of data for a network failure (most class samples). For such cases, when processing unbalanced data, a conventional classifier is usually trained to have a preference, so that most classes predict with a high accuracy, and for few classes the accuracy is low. In methods of processing non-equalized data sets, typically sample-based methods, the non-equalized data sets are changed into equalized data sets by changing the distribution of the data sets.

Most existing methods deal with unbalanced data by generating new Minority samples directly from existing samples, such as the Synthetic Minrity Oversampling Technique (SMOTE) method. The methods are intuitive, but the distribution characteristics of a few types of samples are not deeply mined, so the generated samples are not necessarily helpful for classification, often have adverse effects on classification, and the generated new few types of samples are not representative, so the methods cannot be well applied to network fault prediction.

Disclosure of Invention

The invention aims to overcome the defects in the existing network fault data processing, and provides a network fault data synthesis method based on a Bayesian hybrid model.

The technical scheme of the invention is as follows: a network fault data synthesis method based on a Bayesian hybrid model comprises the following steps:

step 1: set the collected network data set as

Wherein x_nThe system comprises six attributes, namely packet loss rate, terminal download rate, transmission delay, jitter, video transmission quality and terminal user experience score; the data set corresponds to a set of

tags

y

_n0 or 1, i.e. X corresponds to two types of tags, where y _n0 is a network normal class label, y_nThe 1 class is a network fault class label, and because the number of data of the network normal class is far more than that of the network fault class, y is defined_nX corresponding to 1_nThe formed set is a minority of classes

Wherein

As minority class samples, N^almNumber of minority class samples, and y_nX corresponding to 0_nThe set of groups is a plurality of classes

Wherein

For most classes of samples, N^majThe number of most samples;

step 2: the Bayesian mixed model is selected to represent X^almThe probability distribution function expression of (a) includes:

wherein K is a mixed fraction, pi_j(V)、μ_j、Λ_jAnd v_jRespectively representing the weight, the mean, the covariance matrix and the freedom parameter of the jth mixed component;

probability density function for t distribution, expressed as:

wherein N (-) and Gam (-) represent a Gaussian distribution function and a Gamma distribution function, respectively, u_njIs equal to x_nImplicit variable, weight pi, associated with the jth mixed component_j(V) satisfies

The expression is as follows:

variable V in the above formula_jObeying a Beta distribution, i.e. p (V)_j)＝Beta(V _j1, α), α is the hyper-parameter of the Beta distribution, and μ_j,Λ_jObeying a joint Gaussian-Wishart distribution, i.e. the product of a Gaussian distribution and a Wishart distribution, N (-) W (-):

p(μ_j,Λ_j)＝N(μ_j|m_j,λ_jΛ_j)W(Λ_j|W_j,ρ_j)

wherein

A hyper-parameter, m, for the joint Gaussian-Wishart distribution_jIs a six-dimensional column vector, λ_jAnd ρ_jIs a scalar quantity, W_jIs a (6 × 6) matrix; introducing an implicit variable

Wherein z is_nIndicating the current data x_nIs generated by which component in the t-mixture model, when x_nIs generated from the jth mixed component, z_njBased on the above, the hyper-parameters of the entire model are:

and step 3: by using X^almPerforming parameter estimation on the hybrid model, specifically as follows:

3-1) production of N^almObey [1, K]Random integers are uniformly distributed in the interval, and the probability of each integer in the interval is counted; i.e. if N is generated_jAn integer j, then δ_j＝N_j/N^alm(ii) a For each

Corresponding hidden variable z_nIs initially distributed as

z_nIs a K-dimensional vector, which is in each dimension z_njA value on (j ═ 1.., K) is {0,1 };

3-2) setting the hyper-parameters

An initial value of α; for all j (j ═ 1.. times, K), m_j＝0，λ_j＝1，ρ_jTaking any number between 3 and 20, W_jI is a unity matrix, v_jTaking any number between 1 and 100, and taking any number between 1 and 10 for alpha; further, the iteration number count variable k is 1;

3-3) updating hidden variables

The distribution of (a) is, that is,

its hyper-parameter

The update formula of (2) is:

wherein

Calculation at first iteration

When the temperature of the water is higher than the set temperature,

3-4) updating random variables

The distribution of (a) is, that is,

corresponding hyperparameter

The update formula of (2) is as follows:

wherein the content of the first and second substances,

3-5) updating random variables

The distribution of (a) is, that is,

corresponding superParameter(s)

The update formula of (2) is:

3-6) updating hidden variables

Distribution of (2)

Wherein

In the above formula, each term is desired<·>The calculation formula of (a) is as follows:

where Γ (·) is a standard gamma function,Γ (·)' is the derivative of the standard gamma function; in addition to this, the present invention is,

and<u_nj>the calculation methods of (3) have been given in step 3-3) and step 3-4), respectively;

3-7) updating the degree of freedom parameter

That is, the solution contains v as follows_jThe equation of (c):

newton's method is selected to obtain the solution v of the equation_j；

3-8) calculating likelihood value LIK after current iteration_itrItr is the current iteration number:

3-9) calculating the difference value delta LIK (LIK) of the likelihood value after the current iteration and the likelihood value after the last iteration_itr-LIK_itr-1(ii) a If delta LIK is less than or equal to delta, the parameter estimation process is ended, otherwise, the step (3-3) is carried out, the value of itr is increased by 1, and the next iteration is continued; the threshold value delta is within the range of 10^-5～10^-4；

And 4, step 4: generating a new network data set (X) using the estimated Bayesian hybrid model^alm) 'if the data amount to be generated is N', the method includes:

4-1) randomly generating a random number epsilon between 0 and 1 and obeying uniform distribution;

4-2) random Generation compliance

Distributed by

4-3) calculation

4-4) random Generation compliance

Distributed by

4-5) using the estimated

If ε ∈ [0, π₁]Then a distribution t (mu) obeying t is generated₁,Λ₁,v₁) The sample of (1); if it is not

A distribution t (mu) obeying t is generated_k,Λ_k,v_k) The sample of (1); if it is not

A distribution t (mu) obeying t is generated_K,Λ_K,v_K) The sample of (1);

4-6) repeating the above steps (4-1) to (4-5) N' times to obtain (X)^alm) ', the final network failure data set is

The total data set after synthesis is

The invention has the following beneficial effects:

1. the invention well solves the problem that the classification and prediction of the unbalanced data in the network fault prediction task are not accurate enough by generating the network fault data.

2. The invention utilizes the Bayesian mixed model to model the distribution of the network fault data, well grasps the characteristics of the data, and compared with the traditional method, the new network fault data generated by the invention has more representative and classified discrimination.

3. The Bayesian hybrid model designed by the invention can adaptively determine the optimal model structure according to minority class data.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a distribution diagram of an artificially generated sample after fitting with a Bayesian mixture model in accordance with the present invention.

FIG. 3 is a likelihood value variation curve of the Bayesian mixture model iterative process of the present invention.

FIG. 4 is a comparison of G values for the Kmeans-SMOTE method, the GMM oversampling method and the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a network fault data synthesis method based on a bayesian mixture model, which comprises the following steps:

step 1: set the collected network data set as

tags

y

Wherein

Wherein

For most classes of samples, N^majThe number of most samples;

wherein K is a mixed fraction, pi_j(V),μ_j,Λ_j,v_jRespectively representing the weight, the mean, the covariance matrix and the freedom parameter of the jth mixed component.

A probability density function for the t-distribution, which can be expressed as:

wherein N (-) and Gam (-) represent a Gaussian distribution function and a Gamma distribution function, respectively, u_njIs equal to x_nAnd hidden variables associated with the jth mixture component. Weight pi_j(V) satisfies

The expression is as follows:

variable V in the above formula_jObeying a Beta distribution, i.e. p (V)_j)＝Beta(V _j1, α), α is the hyper-parameter of the Beta distribution. In addition, μ_j,Λ_jObeying a joint Gaussian-Wishart distribution (i.e., the product of the Gaussian distribution and the Wishart distribution, N (-) W (-)):

p(μ_j,Λ_j)＝N(μ_j|m_j,λ_jΛ_j)W(Λ_j|W_j,ρ_j)

wherein

A hyper-parameter for the joint Gaussian-Wishart distribution. m is_jIs a six-dimensional column vector, λ_jAnd ρ_jIs a scalar quantity, W_jIs a (6 × 6) matrix. It is also necessary to introduce a hidden variable

Wherein z is_nIndicating the current data x_nIs generated from which component in the t-hybrid model. When x is_nIs generated from the jth mixed component, z _nj1. Based on the above, the hyper-parameters of the whole model are:

(3-1) production of N^almObey [1, K]Random integers are uniformly distributed in the interval, and the probability of each integer in the interval is counted; i.e. if N is generated_jAn integer j, then δ_j＝N_j/N^alm(ii) a For each

Corresponding hidden variable z_nThe initial distribution of (a) is:

in addition, z is_nIs a K-dimensional vector, which is in each dimension z_njA value on (j ═ 1.., K) is {0,1 };

(3-2) setting of hyper-parameters

An initial value of α; for all j (j ═ 1.. times, K), m_j＝0，λ_j＝1，ρ_jCan be any number between 3 and 20, W_jI is a unity matrix, v_jAny number between 1 and 100 can be taken, and any number between 1 and 10 can be taken as alpha; further, the iteration number count variable k is 1;

(3-3) updating hidden variables

The distribution of (a) is, that is,

its hyper-parameter

The update formula of (2) is:

wherein:

calculation at first iteration

When the temperature of the water is higher than the set temperature,

(3-4) updating random variables

The distribution of (a) is, that is,

corresponding hyperparameter

The update formula of (2) is as follows:

wherein the content of the first and second substances,

(3-5) updating random variables

The distribution of (a) is, that is,

corresponding hyperparameter

The update formula of (2) is:

(3-6) updating hidden variables

Distribution of (2)

Wherein:

wherein Γ (·) is a standard gamma function, Γ (·)' is a derivative of the standard gamma function; in addition to this, the present invention is,

and<u_nj>the calculation methods of (4) have been given in step (3-3) and step (3-4), respectively;

(3-7) updating the degree of freedom parameter

That is, the solution contains v as follows_jThe equation of (c):

the solution v of the equation can be obtained quickly by using a common numerical calculation method, such as the Newton method_j；

(3-8) calculating likelihood value LIK after current iteration_itrItr is the current iteration number:

(3-9) calculating the difference value delta LIK (LIK) of the likelihood value after the current iteration and the likelihood value after the last iteration_itr-LIK_itr-1(ii) a If delta LIK is less than or equal to delta, the parameter estimation process is ended, otherwise, the step (3-3) is carried out, the value of itr is increased by 1, and the next iteration is continued; the threshold value delta is within the range of 10^-5～10^-4。

(4-1) randomly generating a random number epsilon between 0 and 1, which is subject to uniform distribution;

(4-2) random Generation compliance

Distributed by

(4-3) calculation of

(4-4) random Generation compliance

Distributed by

(4-5) utilization of the estimated

A distribution t (mu) obeying t is generated_K,Λ_K,v_K) The sample of (1);

(4-6) repeating the above steps (4-1) to (4-5) N' times to obtain (X)^alm) ', the final network failure data set is

The total data set after synthesis is

And (3) comparing the performances:

the clustering effect of the bayesian mixture model (DPMM) was first tested. The idea is as follows: and carrying out unsupervised learning by using a plurality of samples from a plurality of clusters and with unknown cluster labels by using a DPMM clustering algorithm, and finally comparing the clustering result with the labels of the original samples to display the classification effect.

In the experiment, 1000 three-dimensional samples are generated by using three single Gaussian models, and the iteration number of the experiment is 200. The distribution of sample points after the fitting is completed by the Bayesian mixed model designed by the invention is shown in figure 2. The number of correctly classified samples is 942, and the accuracy of the fitting reaches 94.2%. FIG. 3 shows a line graph of the change in the number of classes over 200 iterations, from which the blending score K of the model generally fluctuates around 3 and eventually converges by approximately 160 iterations. Experimental results show that the model structure can be automatically determined from the described samples based on a bayesian mixture model.

Then, the method of the present invention is subjected to a verification experiment with respect to network data provided by a certain network operator. The method is used for synthesizing a new sample and adding the new sample into a minority class, so that the new data set is relatively balanced, then a naive Bayes classifier is used as a base classifier to train and model the new data set, and then a test data set is used for testing. The comparison was performed using the traditional Kmeans-SMOTE method and the GMM oversampling method. The test data set used raw data, and the ratio of minority class to majority class in the test data we chose 1:30, 1:60 and 1:89 to perform the training test, and the results of the experiment are shown in fig. 4: as can be seen from FIG. 4, compared with the Kmeans-SMOTE algorithm and the GMM oversampling method, the DPMM value of the method of the present invention is improved by 16% and 4.8%, respectively. Therefore, the method provided by the invention effectively improves the classification prediction effect of the unbalanced network data.

Claims

1. A network fault data synthesis method based on a Bayesian hybrid model is characterized by comprising the following steps:

step 1: set the collected network data set as

Wherein x_nThe system comprises six attributes, namely packet loss rate, terminal download rate, transmission delay, jitter, video transmission quality and terminal user experience score; the data set corresponds to a set of tags

y_n0 or 1, i.e. X corresponds to two types of tags, where y_n0 is a network normal class label, y_nThe 1 class is a network fault class label, and because the number of data of the network normal class is far more than that of the network fault class, y is defined_nX corresponding to 1_nThe formed set is a minority of classes

Wherein

Wherein

For most classes of samples, N^majThe number of most samples;

wherein K is a mixed fraction, pi_j(V)、μ_j、Λ_jAnd v_jRespectively representing the weight of the jth mixed componentMean, covariance matrix and degree of freedom parameters;

probability density function for t distribution, expressed as:

The expression is as follows:

variable V in the above formula_jObeying a Beta distribution, i.e. p (V)_j)＝Beta(V_j1, α), α is the hyper-parameter of the Beta distribution, and μ_j,Λ_jObeying a joint Gaussian-Wishart distribution, i.e. the product of a Gaussian distribution and a Wishart distribution, N (-) W (-):

p(μ_j,Λ_j)＝N(μ_j|m_j,λ_jΛ_j)W(Λ_j|W_j,ρ_j)

wherein

Wherein z is_nIndication whenPreceding data x_nIs generated by which component in the t-mixture model, when x_nIs generated from the jth mixed component, z_njBased on the above, the hyper-parameters of the entire model are:

Corresponding hidden variable z_nIs initially distributed as

3-2) setting the hyper-parameters

An initial value of α; for all j, j ═ 1_j＝0，λ_j＝1，ρ_jTaking any number between 3 and 20, W_jI is a unity matrix, v_jTaking any number between 1 and 100, and taking any number between 1 and 10 for alpha; further, the iteration number count variable k is 1;

3-3) updating hidden variables