CN115442107A

CN115442107A - Communication data anomaly detection method based on Gaussian mixture model

Info

Publication number: CN115442107A
Application number: CN202211054379.5A
Authority: CN
Inventors: 刘杨; 朱静宇; 孙云霄; 魏玉良; 王孝朋; 王佰玲
Original assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Current assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-12-06

Abstract

The application provides a communication data anomaly detection method based on a Gaussian mixture model, and solves the technical problems that an existing statistical method based anomaly detection prediction method is not ideal in effect and high in calculation complexity. Which comprises the following steps: inputting a data set: inputting a network communication behavior data set, wherein the data set is a time cost set of a plurality of communications at each stage; determining an implicit variable: the data of each stage is from a Gaussian mixture model, an implicit variable is set as a link number Z, and the value range is [1,K ]; the number of Gaussian distributions forming each Gaussian mixture model is equal to the number K of the links; parameter solving: iterative solution is carried out through an EM (effective noise) algorithm, and parameter solution is carried out on the Gaussian mixture model determined by the hidden variables; abnormality detection: when a new communication behavior occurs, the probability that the data points of the communication behavior come from the Gaussian mixture model is calculated, and whether an abnormal attack exists or not is predicted. The method and the device are widely applied to the technical field of communication data anomaly detection.

Description

Communication data anomaly detection method based on Gaussian mixture model

Technical Field

The application relates to the technical field of communication data anomaly detection, in particular to a communication data anomaly detection method based on a Gaussian mixture model.

Background

Anomaly detection refers to the problem of finding data in the data that does not conform to the expected behavior. Anomaly detection has been an active area of research for decades, with early exploration going back to the 1960 s. Due to the ever increasing demand and application in a wide range of fields, such as risk management, compliance, security, financial monitoring, health and medical risk and AI security. Most anomaly detection techniques can be classified as classification-based, nearest neighbor-based, cluster-based, statistical-based, and deep-learning-based.

The classification method learns a classifier from training data and then uses the classifier to classify the test sample. The classification-based anomaly detection technique operates in a similar two-stage manner, with the training stage using labeled training data to learn classifiers, and the testing stage using classifiers to classify test samples as normal or abnormal. Based on the labels available for the training phase, classification-based anomaly detection techniques can be divided into two main categories: multiclass and Oneclass anomaly detection techniques. The multi-class anomaly detection technique assumes that the training data contains labeled samples belonging to multiple normal classes. If none of the classifiers classifies the test sample as normal, it is considered abnormal. The anomaly detection technique based on single-level classification assumes that all training samples have only one class label. Such techniques use a first-level classification algorithm to learn a boundary around normal samples. Various anomaly detection technologies for constructing the classifier include an anomaly detection technology based on a neural network, an anomaly detection technology based on a Bayesian network, an anomaly detection technology based on a support vector machine, and an anomaly detection technology based on rules. The classification-based technique relies on the accuracy of the labels.

Nearest neighbor based anomaly detection techniques require a distance or similarity metric defined between two data samples. The distance or similarity between two data samples can be calculated in different ways and can be broadly classified into two categories: a technique of using the data sample and its K nearest samples as an abnormality score and a technique of calculating the relative density of each data sample to calculate its abnormality score. A key advantage of recent neighbor-based techniques is that they are unsupervised in nature and there is no assumption about the distribution of the generation of data. The computational complexity of the testing phase is a significant challenge as it involves computing the distance of each test sample and all samples belonging to the test data itself or training data to compute the nearest neighbors. Meanwhile, when the data is complex, defining a distance metric between samples may be challenging.

The cluster-based anomaly detection technique is mainly based on the following three assumptions: (1) Normal data samples belong to a class in the data, while abnormal samples do not belong to any class; (2) Normal data samples are located near the centroid of their closest class, while anomalies are far from the centroid of their closest class; (3) Normal data samples belong to large and dense clusters, while anomalies belong to small or sparse clusters. Several clustering-based techniques require distance calculations between two sample points, and therefore, in this respect, they are similar to the nearest neighbor-based techniques, and the selection of the distance metric is crucial to the performance of the technique. However, a key difference between these two techniques is that the cluster-based technique evaluates each sample in a class, while the nearest neighbor-based technique evaluates each sample point in a neighborhood. The method is also unsupervised, but its performance is highly dependent on the effectiveness of the clustering algorithm in the sample set structure. And several clustering-based techniques are effective only if no significant clusters are formed in the anomaly. The computational complexity of this technique also often presents difficulties in solving the problem.

The anomaly detection technique based on statistical methods is based on the following assumptions: normal data samples are generated in the high probability regions of the stochastic model, while abnormal samples are generated in the low probability regions of the stochastic model. The method can be divided into those based on signal processing techniques, those based on principal component analysis and those based on a mixture model. Abnormal behavior in signal processing based techniques is determined by sudden changes that occur in the statistical data, detected using a hypothesis test based on the General Likelihood Ratio (GLR) to provide an abnormal degree between 0 and 1. The method based on principal component analysis carries out singular value decomposition on an original data matrix and sets a threshold value according to needs to retain main characteristics of data, has no assumption of any statistical distribution, can reduce dimensionality of the data without losing any important information, and can reduce computational complexity. In the mixed model method, taking a gaussian-based mixed model as an example, such a technique assumes that data is generated from a gaussian distribution. The parameters are estimated using Maximum Likelihood Estimation (MLE), the distance of a data sample from the estimated mean being the anomaly score for that sample, and a threshold is applied to the anomaly score to determine an anomaly. Different techniques in this category compute the distance to the mean and threshold in different ways. A simple outlier detection technique is to declare all data samples that are more than 3 σ apart from the mean μ of the distribution, where σ is the standard deviation of the distribution and the μ ± 3 σ region contains 99.7% of the data samples. However, this kind of method is highly dependent on model assumptions, that is, when the model assumptions conform to the real distribution of data, a higher accuracy can be obtained, but when the model assumptions do not conform to the real distribution of data, the prediction effect of the model will be poor. Certain analysis needs to be carried out on the data set source to construct a model according with the rule. In addition, the method is high in calculation complexity, is suitable for small sample data sets, but is not suitable for scenes with excessive abnormal data, and the excessive sample sets bring difficulty to model parameter solving.

Disclosure of Invention

In order to solve the technical problem, the technical scheme adopted by the application is as follows: the communication data anomaly detection method based on the Gaussian mixture model comprises the following steps:

inputting a data set: inputting a network communication behavior data set, wherein the data set is a time cost set of a plurality of communications at each stage;

determining hidden variables: the data of each stage is from a Gaussian mixture model, an implicit variable is set as a link number Z, and the value range is [1,K ]; the number of Gaussian distributions forming each Gaussian mixture model is equal to the number K of the links;

parameter solving: carrying out iterative solution through an EM (effective vector) algorithm, and carrying out parameter solution on the Gaussian mixture model determined by the hidden variables;

abnormality detection: when a new communication behavior occurs, the probability that the data points of the communication behavior come from the Gaussian mixture model is calculated, and whether an abnormal attack exists or not is predicted.

Preferably, the data set is input into the data set, and the data set is further subjected to preprocessing, and is divided into a training set and a testing set, wherein the training set contains data which are not attacked, and the testing data set contains data which are not attacked and data which are attacked.

Preferably, the formula for the parametric solution is as follows:

wherein Y is an observation variable, Z is an implicit variable, and the value range is set as [1,K ]]K is the number of the links, the subscript K represents the kth Gaussian distribution, the subscript i represents the ith iteration, and Z is _ik Communicating using the Z link for the data point; mu, sigma and alpha are respectively the mean value, standard deviation and weight coefficient of Gaussian distribution;

substituting the training set into Y in the training set, and carrying out iterative solution; and obtaining the mean matrix mu, the standard deviation sigma and the weight coefficient alpha of each Gaussian mixture model.

Preferably, in the anomaly detection, a threshold is set, and if the probability is smaller than the threshold, the probability that the data point comes from the gaussian mixture model is considered to be too small, and the communication behavior is judged to be abnormal and possibly attacked; otherwise, the communication behavior is judged to be normal and not attacked.

Preferably, the value of K is set according to empirical values; or setting 1-100 cycles for K, using the test set and recording the prediction accuracy under different K values, and selecting the K value with the highest accuracy.

Preferably, the preprocessing further comprises merging different phase data columns of the same encryption algorithm in the data set.

Preferably, in the anomaly detection, the probability calculation specifically includes normalizing the probability density function value of each distribution of the new data point in the gaussian mixture model, and taking the maximum value as the probability.

On the basis of Gaussian distribution, the number K of the links distributed by the closest real data set is found to be valued, and a model of the closest real data distribution is obtained, so that higher prediction accuracy is obtained. The experimental result shows that compared with the abnormal detection method based on classification and clustering, the method has higher prediction accuracy; compared with the anomaly detection method based on deep learning, the method uses less training resources, does not use the average value and the standard deviation to carry out probability prediction on the basis of anomaly detection using a Gaussian mixture model, and uses the maximum value as the probability after the probability density function value of each distribution of data in the mixture model is normalized. Therefore, the required training data set has smaller scale and shorter training time, can achieve the accuracy with equivalent effect, and has obviously better performance in the small sample data set. Meanwhile, the method for calculating the generation probability of the new data in the anomaly detection process is simple, the calculation complexity is low, the method can be well suitable for the expansion of the scale of the data set, and the method has a good prediction effect in a large sample data set.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a communication data anomaly detection method based on a Gaussian mixture model according to the present invention;

fig. 2 is a partial communication traffic data list.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present application clearer, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The communication data anomaly detection method based on the gaussian mixture model provided by the embodiment of the present application is now described.

Please refer to fig. 1, which is a flowchart illustrating a communication data anomaly detection method based on a gaussian mixture model according to an embodiment of the present application. The application provides a communication data anomaly detection method based on a Gaussian mixture model, which is used for detecting anomaly attacks in flow data. The adopted data set is a time cost set of a plurality of communications at each stage, the time cost of each stage is from a Gaussian mixture model, and the number of links is set to represent which Gaussian distribution the data points come from, namely, the number of the Gaussian distributions forming each Gaussian mixture model is equal to the number of the links. Determining the number of Gaussian mixture models, setting hidden variables as link numbers, and performing parameter solution on the Gaussian mixture models determined by the hidden variables through an EM (effective vector) algorithm. And when a new communication behavior occurs, predicting whether an abnormal attack exists by calculating the probability that the data points of the communication behavior come from the Gaussian mixture model. For ease of understanding, the method of this embodiment is further described below:

1. data set partitioning

The data set of the method comes from real network communication behavior, as shown in fig. 2, and is part of data which is not attacked. The total data is nearly 4182, each has 6 columns, each column represents the time duration consumed by different communication phases, and is respectively marked as communication phase 0, communication phase 1, communication phase 2, communication phase 3, communication phase 4 and communication phase 5, and the original data of the data set is marked as C = (C) ₁ ，c ₂ ，c ₃ ，c ₄ ，c ₅ ，c ₆ ) Wherein c is ₁ ，c ₂ ，c ₃ ，c ₄ ，c ₅ ，c ₆ Representing the duration of the communication phases 0, 1, 2, 3, 4, 5 respectively. 70% of the data is used as a training set to construct a model, 30% of the data is used as a positive sample of a test set, 20% of abnormal data of the data set scale is randomly generated to be used as a negative sample of the test set, in addition, nearly 1000 real attacked communication data exist, a label attecked =0 is added after the positive sample, and a label attecked =1 is added after the negative sample to be used as the test set.

2. Model assumptions

A gaussian mixture model is assumed for each column of data. Since the data in different columns represent different communication behaviors, different gaussian mixture models are assumed for each column of data and are solved in principle, but in the data set, the same encryption algorithm is used in the communication stage 2, the communication stage 3 and the communication stage 4, the same gaussian mixture model is used in the three stages for simple calculation, and the processing method is to merge the three columns of data into one column by using mean operation. Raw data C = (C) ₁ ，c ₂ ，c ₃ ，c ₄ ，c ₅ ，c ₆ ) The processed model input had 4182 rows with 4 columns per row, and was denoted as T = (T) ₁ ，t ₂ ，t ₃ ，t ₄ )，T∈R ^4182×4 Wherein, t ₁ ＝C ₁ ，t ₂ ＝c ₂ ，

t ₄ ＝c ₆ 。

Suppose data t ₁ ，t ₂ ，t ₃ ，t ₄ Generated from gaussian mixture models gmm1, gmm2, gmm3, gmm4, respectively. When a Gaussian mixture model is used, hidden variables of the Gaussian mixture model need to be determined, and the hidden variables are set as link numbers in consideration of the characteristics of communication behaviors.

3. Parameter solving

Input of processed T = (T) ₁ ，t ₂ ，t ₃ ，t ₄ ) And using an EM algorithm to carry out iterative solution to obtain the parameters of each Gaussian mixture model: coefficient, mean, and standard deviation, etc. The specific calculation process is as follows:

gaussian Mixture Model (GMM) assumes that the data originates from different gaussian distributions that are combined into one gaussian mixture model. The mathematical representation is a model with a probability distribution as shown in equation (1):

wherein alpha is _i Coefficient, alpha, representing each Gaussian distribution in a Gaussian mixture model _i ≥0，

φ(T|θ _i ) A probability density function representing the ith gaussian distribution,

the k-th gaussian distribution is shown in equation (2):

the EM Algorithm, called Expectation knowledge Algorithm, is an iterative Algorithm used for maximum likelihood estimation or maximum a posteriori probability estimation of a probability parameter model containing hidden variables. The core idea of the EM algorithm is very simple and comprises two steps: extract-Step and Maximization-Step. E-Step estimates parameters mainly by observing data and an existing model, and then calculates expected values of a likelihood function Q by using the estimated parameter values; and M-Step is the corresponding parameter when the likelihood function Q is found to be maximum. The function eventually converges, since the algorithm will guarantee that the likelihood function increases after each iteration.

Recording an observation variable as Y, an implicit variable as a link number Z and a value rangeIs enclosed as [1,K]K is the number of links, then Z _ik Communication is performed using the Z-th link for the data point. The K value needs to be set according to an empirical value, usually 1-50, in order to achieve the best effect, circulation is set for the K value in actual operation, prediction accuracy rates under different K values are recorded, and the K value with the highest accuracy rate is selected. The solving process of the EM algorithm can be summarized as follows:

(1) Randomly or empirically selecting the initial value of the parameter theta ⁽⁰⁾ Starting iteration;

(2) E, step E: note theta ⁽ⁱ⁾ Calculating the estimated value of the parameter theta of the ith iteration in the step E of the (i + 1) th iteration:

(3) And M: for Q (theta ) ⁽ⁱ⁾ ) And (3) solving a maximum value to obtain a parameter updating value:

θ ⁽ⁱ⁺¹⁾ ＝argmax Q(θ，θ ⁽ⁱ⁾ ) (4)

(4) And (4) repeating the step (2) and the step (3) until the model converges.

Substituting the formula (1) into the formula (3) can obtain the specific form of the Q function in the Gaussian mixture model:

the maximum parameter solution is performed on the form, and the parameter update value of each round can be obtained, and the specific form is shown in formulas (6), (7) and (8).

Wherein Y is an observation variable, Z is a hidden variable link number, and the value range is set as [1,K ]]K is the number of the links, the subscript K represents the kth Gaussian distribution, the subscript i represents the ith iteration, and Z is _ik Communicating using the Z link for the data point; μ, σ, α are the mean, standard deviation and weight coefficients of the gaussian distribution, respectively.

Mixing T = (T) ₁ ，t ₂ ，t ₃ ，t ₄ ) Substituting Y into the solution, and setting the value range of Z as [1,K ]]And carrying out iterative solution. The mean matrix mu, the standard deviation sigma and the weight coefficient alpha of each Gaussian mixture model can be obtained. For example, suppose we set the maximum value of z in gmm1, that is, the value of the number of links K is 2, and after the iteration of the EM algorithm is completed, the return value mean value matrix μ = [ μ ], is obtained ₁ ，u ₂ ]Standard deviation matrix σ = [ σ ] ₁ ，σ ₂ ]Weight matrix α = [ α ] ₁ ，α ₂ ]The representative gmm1 model is a mixture of two Gaussian distributions, and the coefficient of the first Gaussian distribution is α ₁ Mean value of u ₁ Standard deviation of σ ₁ The second is the same, wherein ₁ +α ₂ ＝1。

4. Anomaly detection

And when new data occurs, predicting whether the abnormal attack exists or not by calculating the probability that the new data point comes from the Gaussian mixture model.

To further illustrate, the new data point is denoted as q, and in the present embodiment, the new data point has four dimensions, i.e., q = (q 1, q2, q3, q 4). Calculating the probability that the new data point q comes from the gaussian mixture model gmm1, gmm2, gmm3 and gmm4, the probabilities p1, p2, p3 and p4 that q1, q2, q3 and q4 come from gmm1, gmm2, gmm3 and gmm4 respectively need to be calculated. Taking q1 as an example, the following description is given: calculating the probability p1 that the new data point q1 comes from the model gmm1, specifically, after normalization by using formula (9), taking the maximum seat probability p1 by using formula (10), wherein the formula (9) (10) is used for showing:

p11＝α ₁ φ ₁ (q)/(α ₁ φ ₁ (q)+α ₂ φ ₂ (q))，p12＝α ₂ φ ₂ (q)/(α ₁ φ ₁ (q)+α ₂ φ ₂ (q)) (9)

p1＝max(p11，p12) (10)

wherein p11 is the probability that q1 is from the first Gaussian distribution in gmm1, and q12 is the probability that q1 is from the second Gaussian distribution in gmm 1;

is a probability density function of a gaussian distribution.

After p1, p2, p3, and p4 are calculated by using equation (10), respectively, since the different column data satisfy the independent property, the probability that the data point q comes from the gaussian mixture model constructed by T can be obtained, and equation (11) is as follows:

p＝p1×p2×p3×p4 (11)

setting a threshold value c according to experience, wherein c is usually set to be in a range of 80% -90%, if p < c, the probability that a data point q comes from the model is considered to be too small, judging that the communication behavior represented by the data point is abnormal and possibly attacked, and returning to predicted _ acknowledged =1; otherwise, the data point represents the normal communication behavior and is not attacked, and the predicted _ acknowledged =0 is returned.

And (3) carrying out verification test on the obtained model, wherein the specific process is as follows:

setting the threshold c to be 0.9, and obtaining a training set and a test set according to the division in the step 1. Firstly, a training set is used, and the

steps

2 and 3 are utilized to carry out model hypothesis and parameter solution, so that four communication stage hybrid models gmm1, gmm2, gmm3 and gmm4 are obtained. The model accuracy is then verified using the test set. Inputting test data q = (q 1, q2, q3, q 4), wherein q has labels attached =0 or attached =1, comparing whether the predicted _ attached value calculated in the step 4 is equal to the label attached value of q, if so, the prediction is successful, otherwise, the prediction is failed. Recording the number of successfully predicted data pieces as num1 and the number of unsuccessfully predicted data pieces as num2, the calculation formula (12) of the accuracy can be obtained as follows:

accuracy＝num1/(num1+num2) (12)

and adjusting the link number value K from 1 to 100, and observing the prediction accuracy of different models generated by different K values, wherein the table 1 is a corresponding table of the highest accuracy rate which can be achieved by the models and the link number value K which is set to achieve the accuracy rate.

TABLE 1 model accuracy and Link count values

As can be seen from Table 1, the four models all have an accuracy of over 91%, with a gmm3 accuracy even higher than 99%.

In the statistic-based anomaly detection technique, when the distribution assumed by the model is close to the real distribution of the numbers, the prediction accuracy of the model is very high. The method analyzes the characteristics of the communication data set, finds that the communication time length of the data set is obtained by overlapping different factors, such as the calculation time length of an encryption algorithm, the response time length of a selected link, a server and the like, and the possibility that the data value conforms to a hybrid model is high due to the factors. According to the behavior characteristics of most data, after Gaussian distribution is selected, the method finds the value of the number K of the links distributed by the closest real data set to obtain a model distributed by the closest real data, so that higher prediction accuracy is obtained. Experiment results show that compared with an anomaly detection method based on classification and clustering, the method has higher prediction accuracy on the data set, wherein the prediction accuracy of the decision tree method is only 83%; compared with the anomaly detection method based on deep learning, the method uses less training resources, does not use the average value and the standard deviation to carry out probability prediction on the basis of anomaly detection using a Gaussian mixture model, and uses the maximum value as the probability after the probability density function value of each distribution of data in the mixture model is normalized. Therefore, the required training data set has smaller scale and shorter training time, can achieve the accuracy with equivalent effect, and has obviously better performance in the small sample data set. Meanwhile, the method for calculating the new data generation probability in the anomaly detection process is simple, the calculation complexity is low, the method can be well suitable for the expansion of the data set scale, and a good prediction effect is also shown in a large sample data set. Meanwhile, the method for calculating the generation probability of the new data in the anomaly detection process is simple, the calculation complexity is low, the method can be well suitable for the expansion of the scale of the data set, and the method has a good prediction effect in a large sample data set.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A communication data anomaly detection method based on a Gaussian mixture model is characterized by comprising the following steps:

parameter solving: iterative solution is carried out through an EM (effective noise) algorithm, and parameter solution is carried out on the Gaussian mixture model determined by the hidden variables;

2. The gaussian mixture model-based communication data anomaly detection method according to claim 1, wherein: in the input data set, the data set also needs to be preprocessed, and the data set is divided into a training set and a test set, wherein the training set comprises data which is not attacked, and the test data set comprises data which is not attacked and data which is attacked.

3. The gaussian mixture model-based communication data anomaly detection method according to claim 2, wherein: the formula for solving the parameters is as follows:

substituting the training set into Y to carry out iterative solution; and obtaining the mean matrix mu, the standard deviation sigma and the weight coefficient alpha of each Gaussian mixture model.

4. The communication data anomaly detection method based on the Gaussian mixture model as claimed in claim 1, characterized in that: in the abnormal detection, a threshold value is set, if the probability is smaller than the threshold value, the probability that the data point comes from the Gaussian mixture model is considered to be too small, and the communication behavior is judged to be abnormal and possibly attacked; otherwise, the communication behavior is judged to be normal and cannot be attacked.

5. The communication data anomaly detection method based on the Gaussian mixture model as claimed in claim 3, characterized in that: the K value is set according to an empirical value; or setting 1-100 cycles for K, using the test set and recording the prediction accuracy under different K values, and selecting the K value with the highest accuracy.

6. The gaussian mixture model-based communication data anomaly detection method according to claim 2, wherein: the preprocessing further comprises merging different-phase data columns of the same encryption algorithm in the data set.

7. The communication data abnormality detection method based on the gaussian mixture model according to any one of claims 1 to 6, wherein: in the anomaly detection, the probability calculation is specifically to normalize the probability density function value of each distribution of a new data point in the gaussian mixture model, and then take the maximum value as the probability.