CN116684877A

CN116684877A - GYAC-LSTM-based 5G network traffic anomaly detection method and system

Info

Publication number: CN116684877A
Application number: CN202310539066.7A
Authority: CN
Inventors: 孙茜; 田霖; 路淼顺
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-09-01

Abstract

The invention provides a 5G network traffic anomaly detection method and system based on GYAC-LSTM, comprising the following steps: acquiring 5G network flow data marked with abnormal information, and performing condition filtering on the 5G network flow data to obtain original training data; performing network packet analysis on the original training data to obtain feature values in the data and respectively generating respective candidate feature sets, and performing feature selection on each candidate feature set based on GYAC so as to remove network traffic data in the candidate feature set and obtain target training data; inputting the target training data into an LSTM model, executing 5G network flow anomaly detection, and constructing a loss function to train the LSTM model based on the detection result and the anomaly information marked by the target training data to obtain a flow anomaly detection model; inputting the 5G network traffic to be detected into the traffic abnormality detection model to obtain abnormality information of the 5G network traffic to be detected as a detection result.

Description

GYAC-LSTM-based 5G network traffic anomaly detection method and system

Technical Field

The invention relates to the technical field of attack detection in mobile communication, in particular to a method for constructing a GYAC-LSTM anomaly detection model oriented to 5G network traffic.

Background

The 5G network user plane uses techniques and methods such as user plane function (user plane function, UPF) network element sinking and edge computing, so that the 5G network user plane can bear low-delay and high-bandwidth traffic data, and the customization requirements of different users and vertical industries are met, therefore, the 5G network traffic will be in large scale growth, ericsson indicates in a published mobility report that the global mobile network data traffic reaches about 108EB from the second quarter to the third quarter in 2022. While the traffic is growing, the botnet formed by hundreds of millions of unsafe devices and a large number of nodes connected with each other will cause stronger and more complex DDoS traffic attacks with greater hazard, and for the vertical industry, the user plane traffic involves a large amount of relevant instruction data such as control, operation, maintenance, etc., and once being utilized by an attacker, the traffic will cause security risks such as production control disorder, service stop and swing, etc., so that research on 5G network traffic anomaly detection technology is urgently needed.

In order to detect abnormal traffic of a 5G network, more features are usually obtained to avoid information loss, so that a data set may contain a large number of uncorrelated or redundant features, resulting in large traffic sample data size and high feature dimension. In general, the types of feature selection algorithms are classified into a filtering type method, a wrapping type method and an embedded type method, and the filtering type feature selection method uses information gain, chi-square test, relief and the like to perform important feature selection. The wraparound approach typically uses methods such as recursive elimination to select feature subsets based on model scores, which typically enables selection of optimal feature subsets, but is more complex. The embedded method fuses the feature selection algorithm with the training model, such as using tree-based or penalty-based terms, so that the training process automatically selects features, but the method is also high in complexity, and how to combine the feature selection method with the model is a problem to be considered.

In order to detect abnormal traffic of the 5G network, more features are usually obtained to avoid information loss, so that the data set may contain a large number of uncorrelated or redundant features, which results in large data scale of traffic samples and high feature dimension, and the conventional anomaly detection method uses all features for model training, which results in longer data processing time and more complex model, so that 5G network traffic feature selection facing to a large number of high dimensions becomes one of important measures for abnormal traffic detection and is important to be studied in the invention.

In addition, 5G traffic anomaly detection is a problem of long time series data analysis, and conventional traffic anomaly detection RNN techniques are superior in anomaly detection of non-long time series data, but 5G traffic time series lengths are hundreds of thousands or even millions, and since the time series lengths are extremely large, conventional RNN techniques have problems of gradient extinction and gradient explosion when processing such long time series, it is also important to solve the defects of the conventional RNN techniques by which the present invention is adopted.

Disclosure of Invention

The invention aims to solve the problem of how to efficiently extract characteristics of a large amount of high-dimensional 5G network traffic in the prior art, and provides a filtering type characteristic selection method based on a base index and cosine similarity. The method removes redundancy of the feature set while screening important features, and is suitable for processing large-scale flow data. The invention provides a 5G network flow anomaly detection method based on GYAC-LSTM, which comprises the steps of firstly preprocessing original data by using conditional filtering, then obtaining characteristic values in the data by using a statistical method and generating a candidate characteristic set, then screening important characteristics and removing redundant characteristics based on GYAC (Gini Index, year-on-Year Decrease Index And Cosine Similarity) characteristic selection algorithm, reducing the dimension of experimental characteristics, and using a newly generated target characteristic subset as input for LSTM model training.

Aiming at the defects of the prior art, the invention provides a 5G network traffic anomaly detection method based on GYAC-LSTM, which comprises the following steps:

a data preprocessing step, namely acquiring 5G network flow data marked with abnormal information, and performing condition filtering on the 5G network flow data to obtain original training data;

a feature selection step, namely performing network packet analysis on the original training data to obtain feature values in the data and respectively generating respective candidate feature sets, and performing feature selection on each candidate feature set based on GYAC so as to remove network traffic data in the candidate feature set and obtain target training data;

the training step, inputting the target training data into an LSTM model, executing 5G network flow anomaly detection, and constructing a loss function to train the LSTM model based on the detection result and the anomaly information marked by the target training data to obtain a flow anomaly detection model;

and an abnormality detection step, namely inputting 5G network traffic to be detected into the traffic abnormality detection model to obtain abnormality information of the 5G network traffic to be detected as a detection result.

The GYAC-LSTM-based 5G network traffic anomaly detection method comprises the following data preprocessing steps:

Firstly, cleaning data by using a filtering method to remove irrelevant data, then sequencing the cleaned flow data according to a time sequence, setting the maximum length as Z and the time interval as t, and dividing all data flows within t seconds into a data flow subset flow _t Calculating flow by using statistical method _t K-dimensional feature set F of (2) _K,t Wherein F _K,t ＝{f _1,t ,f _2,t ,...,f _K,t Arranging the eigenvectors of all time intervals t to finally obtain a data set D of a class of flow _K,T Wherein D is _K,T ＝{F _K,1 ,F _K,2 ,...,F _K,T }；

The 5G network traffic anomaly detection method based on GYAC-LSTM comprises the following steps: the contribution degrees of the different types of flow data are sequentially calculated, summed and ranked by using the difference value of the base-Ni index and the comparably reduced index to jointly represent the contribution degrees of the two types of flow data, so that the importance ranking of the flow characteristics is obtained; the cosine similarity is used for representing the correlation among different features in the similar flow data and converting the correlation into the distance between the features, then the features with the correlation coefficient and the importance coefficient ranked at the front are selected according to the redundancy threshold and the importance threshold, and finally the features selected by each type of flow are combined to obtain the target training data;

The 5G network traffic anomaly detection method based on GYAC-LSTM comprises the following steps:

for a three-dimensional dataset D _N×K×T Wherein N represents the number of data types, K represents the dimension of the feature, T represents the length of the data, andrepresenting a certain type of traffic data, then +.>Is expressed as a feature set of (1)Wherein-> The uncertainty or uncertainty coefficient of a feature is measured by a base index, as shown in formula (7):

in the middle ofRepresentation feature->The probability of different values in the model is that the value of the base index of the characteristic value is between 0 and 1; let n be ₁ ,n ₂ E N, representing the feature ∈N using the Basil difference>Data +.>And the contribution coefficient of the classification is one, as shown in a formula (8):

representing characteristic value +.>The average value at the total length T is represented by the following formula (9):

characterizing using a comparably decreasing indexData +.>And a contribution coefficient II of the classification is shown in the following formula (10):

in the middle ofThe value is between 0 and 1, the first contribution coefficient and the second contribution coefficient are weighted differently and summed together to be used as the characteristic +.>The total contribution of the two types of flow data is distinguished as shown in the following formula (11):

by calculation ofThe contribution degree of the flow data and other flow data are summed to obtain the characteristic +. >Is represented by the following formula (12):

calculating correlation between features using cosine similarity, setting k ₁ ,k ₂ E K, thenAnd->The correlation coefficient of (2) isThe distance between the two coefficients is represented by the following formula (14):

sequentially calculatingAll distances from other features; according to the characteristics->The importance characteristic selection is carried out on the importance coefficient and the distance coefficient of the system, and the system is used for carrying out the importance characteristic selection according to the preset redundancy threshold value m ₁ Retention and characterization->The feature quantity ratio of the farthest distance is calculated according to a preset importance threshold value m ₂ According to m ₁ Removal and characterization->After redundant features, the feature quantity ratio of the forefront ranking of the importance coefficient is reserved to obtain +.>Is to +.>Is obtained by taking the union of the feature subsets of (2) to obtain D _N×K×T Is included in the target training data.

The 5G network traffic anomaly detection method based on GYAC-LSTM comprises the following training steps:

the input layer of the LSTM model normalizes the features of the target training data, as shown in the following formulas (15) and (16):

in the middle ofRepresenting an input raw feature value; />Respectively representing the minimum value and the maximum value in the feature set; v (V) _max 、V _min Respectively representing the maximum value and the minimum value of the mapping interval; />Representing the scaled normalization result; let the total time step of the data be T, each step contains K-dimensional features, the Input is denoted input= { X _1,K ,X _2,K ,...,X _T,K Each training comprises T time steps, each time step corresponds to an LSTM unit, each unit processes K-dimensional characteristic data, and finally, a prediction result is input into a full-connection layer; the Output dimension of the full connection layer is the type number of the traffic data, and the Output value is output= { Y assuming N ₁ ,Y ₂ ,...,Y _N In order to correspond the output of the full connection layer to the probability of predicting a certain category, mapping the output to be between 0 and 1 by using a softmax function, ensuring that the sum of all the outputs is 1, taking the sum as the probability value of classifying under the current input, setting y _i ＝[y ₁ ,y ₂ ,...,y _N ]Y of it _i The output probability value after the softmax function is shown in the following formula (17);

s in _i The representation model predicts a probability value of a certain type, and takes a category corresponding to the maximum probability as the detection result.

The invention also provides a 5G network traffic anomaly detection system based on GYAC-LSTM, which comprises:

the data preprocessing module is used for acquiring 5G network flow data marked with abnormal information, and performing condition filtering on the 5G network flow data to obtain original training data;

the feature selection module performs network packet analysis on the original training data to obtain feature values in the data and respectively generates respective candidate feature sets, and performs feature selection on each candidate feature set based on GYAC so as to remove network traffic data in the candidate feature set and obtain target training data;

The training module inputs the target training data into an LSTM model, performs 5G network flow anomaly detection, and constructs a loss function to train the LSTM model based on the detection result and the anomaly information marked by the target training data to obtain a flow anomaly detection model;

and the abnormality detection module inputs the 5G network traffic to be detected into the traffic abnormality detection model to obtain abnormality information of the 5G network traffic to be detected as a detection result.

The GYAC-LSTM-based 5G network traffic anomaly detection system comprises a data preprocessing module, a data processing module and a data processing module, wherein the data preprocessing module comprises:

The GYAC-LSTM-based 5G network traffic anomaly detection system comprises: the contribution degrees of the different types of flow data are sequentially calculated, summed and ranked by using the difference value of the base-Ni index and the comparably reduced index to jointly represent the contribution degrees of the two types of flow data, so that the importance ranking of the flow characteristics is obtained; the cosine similarity is used for representing the correlation among different features in the similar flow data and converting the correlation into the distance between the features, then the features with the correlation coefficient and the importance coefficient ranked at the front are selected according to the redundancy threshold and the importance threshold, and finally the features selected by each type of flow are combined to obtain the target training data;

The GYAC-LSTM-based 5G network traffic anomaly detection system comprises:

in the middle ofRepresentation feature->The probability of different values in the model is that the value of the base index of the characteristic value is between 0 and 1; let n be ₁ ,n ₂ ∈NUsing the value of the difference of Kennel to represent the characteristic +.>Data +.>And the contribution coefficient of the classification is one, as shown in a formula (8):

The GYAC-LSTM-based 5G network traffic anomaly detection system comprises a training module, wherein the training module comprises:

The advantages of the invention are as follows:

the invention provides a GYAC-LSTM-based 5G network flow anomaly detection method, which introduces a base-Ni index, a homonymy reduction index and cosine similarity to carry out filtering type feature selection, wherein the base-Ni index is more focused on the distinguishing degree of different samples, so that the difference between different features can be better excavated. The scaling index can better reflect the change in scale of the feature. The cosine distance has better performance when calculating the characteristics of large scale difference and sparse sample data points, and is suitable for processing high-dimensional data. Further considering the time sequence characteristics of the features, the advantage of LSTM analysis time sequence data is utilized, the feature subset after dimension reduction is used for LSTM network training, and the accuracy of 5G network abnormal flow detection is improved. In experimental analysis, a 5G network experimental platform is used for generating various 5G network abnormal flow data sets for analysis and verification of the detection method proposed by the user. Experimental results show that the method provided by the invention can greatly reduce the characteristic dimension and can keep the accuracy of identifying abnormal flow at 93% -98%. Compared with the similar algorithm, the training time cost is saved by 8-44%.

In addition, in order to solve the problem that the 5G network traffic with long time series characteristics has reduced abnormality detection efficiency caused by gradient explosion and gradient disappearance in the training process of the traditional abnormality detection CNN technology, a 5G network traffic abnormality detection method based on a radix index and cosine similarity (Gini Index And Cosine Similarity, GIACS) and an LSTM network is provided. The method greatly reduces the characteristic dimension, and simultaneously, the accuracy of abnormal flow identification can be kept at 93% -99% under the condition that the characteristic dimension is only 25%, 50% and 75% of the total characteristic set.

Drawings

FIG. 1 is a flow chart of a filtering method;

FIG. 2 is a flow chart of a wrap-around method;

FIG. 3 is a flow chart of an embedded method;

FIG. 4 is a block diagram of an LSTM cell;

FIG. 5 is a schematic diagram of a GYAC-LSTM based 5G traffic anomaly detection architecture;

FIG. 6 is a flow chart of data preprocessing;

FIG. 7 is a flow chart of a GYAC feature selection algorithm;

FIG. 8 is a LSTM model diagram;

FIG. 9 is a schematic diagram of a 5G network traffic attack;

fig. 10 is an initiation 5G DoS Hulk Attack schematic;

fig. 11 is a diagram illustrating a normal UE registration failure;

fig. 12 is a schematic diagram of normal UE registration success after stopping the attack;

fig. 13 is an initiation 5G DoS Slowhttp Attack schematic;

FIG. 14 is a schematic diagram of service unavailability;

FIG. 15 is a schematic diagram of a system error;

fig. 16 is an initiation 5G DDoS Bonesi Attack schematic;

FIG. 17 is a schematic diagram of traffic data when no attack is initiated;

FIG. 18 is a schematic diagram of traffic data after attack initiation;

FIG. 19 is a 5G DoS Hulk Attack important feature ranking chart;

FIG. 20 is a 5G DoS Slowhttp Attack important feature ranking chart;

FIG. 21 is a 5G DDoSBonesi Attack important feature ranking chart;

FIG. 22 is a normal traffic importance feature ranking map;

FIG. 23 is a schematic illustration of feature dimension reduction rates at different thresholds;

FIG. 24 is a graph of detection performance and training time for different feature dimensions;

FIG. 25 is a graph of performance versus cost for different feature dimensions;

FIG. 26 is a graph of detection performance and training time for different feature dimensions.

Detailed Description

The inventor finds that the defect is caused by the characteristic that the 5G network traffic volume is high-dimensional and long-time series is not considered when the 5G network traffic volume abnormality detection technology is researched, and the inventor discovers that the defect can be solved by introducing the base index, the same-ratio reduction index and the cosine similarity to carry out the filtering type feature selection method when the filtering type technology in the base index, the same-ratio reduction index, the cosine similarity and the feature selection and a special CNN network LSTM network are researched, wherein the reason why the base index is more focused on the distinguishing degree of different samples and the difference among different features can be better mined. The scaling index can better reflect the change in scale of the feature. The cosine distance has better performance when calculating the characteristics of large scale difference and sparse sample data points, and is suitable for processing high-dimensional data. Further considering the time sequence characteristics of the features, the advantage of LSTM analysis time sequence data is utilized, the feature subset after dimension reduction is used for LSTM network training, and the accuracy of 5G network abnormal flow detection is improved.

In order to achieve the technical effects, the invention comprises the following technical key points:

the key point 1 aims at the problems of long time consumption and poor detection performance of abnormal detection data caused by the characteristics of mass high-dimensional and long-time sequences of 5G network flow data; the invention provides a 5G flow abnormality detection method and a framework based on GIACS-LSTM, wherein the framework has two main functional components, namely: data preprocessing and feature selection function and 5G flow abnormality detection function.

The key point 2 is used for aiming at the data preprocessing and feature selection part mentioned in the key point 1 and is oriented to a high-volume and high-dimensional 5G network flow feature selection method; the important characteristics of the 5G network flow are screened, irrelevant characteristics and redundant characteristics are removed, and the dimension of a flow sample is reduced;

the key point 3 performs data preprocessing on the 5G original flow data aiming at the data preprocessing part in the key point 2; and (3) cleaning the data by using the filtering condition, removing irrelevant data and converting the character features of the data into digital features.

The key point 4 is used for carrying out flow data feature selection by using a feature selection algorithm based on a radix index and cosine similarity (GIACS) aiming at a data feature selection part in the key point 2 and aiming at a data set of 5G network flow data obtained after data preprocessing; the method has the advantages that important features for the anomaly detection model are screened, redundant features are removed, the dimension reduction of high-dimension flow data is realized, the scale of training data of the anomaly detection model is reduced, and the model training speed is accelerated;

The key point 5 is used for representing contribution degree to 5G two types of flow classification by using a base index difference value and a comparably reduced index together and representing correlation among different features in 5G flow data by using cosine similarity according to the feature selection algorithm based on the GIACS mentioned in the key point 4; according to the method, the importance ranking of the flow characteristics of the 5G type of flow can be obtained, the correlation among different characteristics in the same type of flow data can be obtained, and the characteristics with the correlation coefficient and the importance coefficient ranked at the front can be selected according to the importance threshold and the redundancy threshold to obtain a target characteristic subset;

the key point 6 is used for representing the classification contribution degree of two types of flow of 5G by utilizing the difference value of the base index and the reduction index of the same ratio, which are mentioned in the key point 5, firstly, the Gini index of a certain feature in the flow of 5G is defined, so that the Gini difference value is obtained to represent the contribution degree coefficient I of the feature to distinguish the two types of flow data, and then, the feature average value is calculated, so that the reduction index of the same ratio is obtained to represent the contribution degree coefficient II of the feature in different data types; summing the contribution coefficient I and the contribution coefficient II by taking different weights as the total contribution of the feature to distinguish the two types of flow, and further calculating to obtain a feature importance coefficient, wherein the larger the coefficient is, the higher the benefit of the feature on all data classification is;

The key point 7 is used for representing the correlation among different features in the 5G flow data by using cosine similarity, firstly defining correlation coefficients among the different features, and then converting the correlation coefficients into two coefficients which are convenient to calculate and sequence; the method can limit the distance between the features to be more than zero, and is convenient for calculating and sequencing the feature distance;

the key point 8 is used for reserving the feature quantity duty ratio which is farthest from the features aiming at the redundancy threshold and the importance threshold mentioned in the key point 5, wherein the importance threshold is used for reserving the feature quantity duty ratio which is the feature quantity duty ratio of which the importance coefficient is ranked at the front after the redundant features are removed through the redundancy threshold; according to the importance coefficient and the distance coefficient of the features, the important features are selected to obtain the optimal feature set of the original data set, and the method not only effectively reduces feature dimensions, but also dynamically generates feature subsets by controlling two thresholds so as to meet different scene requirements;

the key point 9 is aimed at a 5G network flow attack case under a 5G network experimental environment used by the invention, and is aimed at UPF network elements with lower safety, which are generally sunk to the vicinity of unattended base stations or edge clouds to meet the time delay requirement; because the attacker aims at the UPF network element with lower security, the attacker can know from the scheme by utilizing or hijacking, and the invention has the advantages that: three DoS attack schemes will be implemented that pose security threats to 5GC and other important network elements.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The characteristic selection method comprises the following steps:

network traffic contains a large number of related and redundant features, significantly reducing the performance of large-scale data-driven network anomaly detection. Therefore, in the data preprocessing process, important characteristics for the model can be screened and redundant characteristics can be removed, and the method has the advantages that: the unimportant and redundant features in the feature set can be removed, the training data scale is reduced, and the model training speed is accelerated; model overfitting is avoided by reducing feature dimensions. Feature selection methods can be generally classified into a filtering method, a wrapping method, and an embedding method.

(1) Filtration method

The filtering method is to rank the features and attribute scores among the features to realize the screening and selection of the features, and is irrelevant to the subsequent feature learning process. Generally, the attributes of a feature include divergence and relevance. Divergence refers to the purity, uncertainty, amount of information contained, etc. of a feature, and is mainly expressed using a base index, information entropy, etc. The correlation refers to the correlation, similarity and the like between features, and is mainly represented by cosine distance, euclidean distance and pearson correlation coefficient. And generating a target feature subset meeting the requirements according to the sorting result of the attribute scores by the set threshold. The main filtering feature selection methods at present include information gain, variance threshold, chi-square test, relief and the like. The flow of the filtering feature selection method is shown in fig. 1.

(2) Wrapping method

The wrap-around method is a feature selection method with learner performance as an evaluation criterion, and the purpose of the method is to generate a feature set that is most favorable for learner performance. The main idea is to construct feature sets by adding or deleting features recursively and input the feature sets into a learner until all feature subsets are traversed to obtain target feature subsets with optimal learning performance, so that the method can be regarded as a greedy algorithm for searching the optimal subsets, and the main methods at present are genetic algorithm, particle swarm algorithm, evolutionary algorithm, differential algorithm and the like. The flow of the wrapped feature selection method is shown in fig. 2.

(3) The embedded method comprises the following steps:

the embedded method combines the characteristic selection process and the learner training process together, so that the learner can automatically select the characteristics in the training process. The method mainly comprises the steps of fitting data by using a plurality of special models, then taking the evaluation attribute of the characteristics of the models as an evaluation index, and carrying out characteristic selection based on the wrapped ideas, wherein the embedded characteristic selection method mainly uses characteristic selection based on punishment items, characteristic selection based on tree models and the like. The flow of the embedded feature selection method is shown in fig. 3.

The wrapped method can make the performance of the learner better, but the selected feature subset has poorer generality, the feature selection needs to be carried out again according to the learning algorithm, and the computing resource and the memory resource consumption of the algorithm are generally larger. For the embedded method, how to integrate the characteristic selection process into different learners is a problem to be solved, and the algorithm also faces the situations of large calculation amount and complex calculation process. The filtering type method has strong universality and low algorithm complexity, is suitable for large-scale data sets, and is widely applied in practical application.

LSTM network:

in the aspect of anomaly detection, the deep learning technology is widely applied because the threat and anomaly of a computer network can be detected in various applications, wherein an artificial neural network RNN with a sequential structure can solve the problem of time series in theory, the core of the network is to use a special memory unit to store the state of the last moment, so that the output of a hidden layer of each moment of the RNN can be transmitted to the next moment, therefore, the network at each moment can keep certain historical information from the previous moment and can be transmitted to the next moment together with the network state at the current moment, and 5G network flow is a long-time series with the sequential structure.

However RNNs have problems of gradient extinction and gradient explosion when handling long time sequences. This is because activation of the function derivative during parameter transfer performs cumulative multiplication, and when the weight is smaller than 1 or the weight is large, the phenomena of gradient extinction and gradient explosion are almost inevitably caused. To address this drawback, the need for long-time sequential anomaly detection of 5G network traffic is met, and LSTM networks are therefore introduced.

LSTM as a special RNN network mainly for solving the problems of gradient extinction and gradient explosion during long-sequence training, with better performance in long-sequence analysis, LSTM taking into account the long-term dependency between long-distance parameters, using cell states to replace neurons in hidden layers, each cell state being protected and controlled by three gates, namely a forgetting gate controlling the information to be discarded according to equation (1), an input gate determining the information to be updated according to equation (2), an output gate for updating the state of the cell according to equation (3), representing cell states or long-term memory according to equation (4), representing short-term memory according to equation (5), representing candidate states according to equation (6), x _t Representing the current input, tanh represents the hyperbolic tangent activation function, σ represents the Sigmoid activation function, and the LSTM basic structure is shown in fig. 4.

However, since the LSTM network itself has a complex structure, training is time-consuming, and each LSTM cell has 4 full connection layers (MLPs), if the LSTM time span is long or the network is deep, the calculation amount will be large, which results in very time-consuming. According to the GYAC-LSTM-based 5G network flow characteristic selection algorithm, the defect is relieved by reducing the input flow characteristic dimension while the high anomaly detection rate is ensured, and experiments prove that the anomaly detection accuracy and the F1 value can still be kept at about 90% when the test flow characteristic dimension is reduced by 75%, and the training time is reduced by about 15%, so that the calculated amount of an LSTM network is reduced while the training time of the LSTM network is reduced, the calculation time is saved, and the defects of large calculated amount and time consumption of the LSTM network are relieved to a certain extent.

The LSTM has three phases, namely a forgetting phase, a selection memory phase and an output phase, wherein the forgetting phase of the LSTM is the cell state C transmitted to the previous node _t-1 The information in the process is forgotten selectively through a forgetting door f _t Is realized such that C _t-1 Between 0 and 1, where 0 means completely discarded and 1 means completely reserved. The selection memory stage is to input the current nodeSelective memorization is performed by input gating i _t Realized such that C _t Also between 0 and 1, will pass through the forgetting gate f _t And i of input gate _t Adding to obtain the long-term memory C of the next stage _t . The output phase is to control which information is output as the current state, the process first uses tanh to process the current cell state to obtain a value between-1 and 1, and then passes through the output gate o _t The multiplication outputs the required information.

f _t ＝σ(W _f [h _t-1 ,x _t ]+b _f ) (1)

i _t ＝σ(W _i [h _t-1 ,x _t ]+b _i ) (2)

o _t ＝σ(W _o [h _t-1 ,x _t ]+b _o ) (3)

h _t ＝o _t tanh(C _t ) (5)

By using the internal structure of the LSTM, the state of the outgoing is controlled by the gating unit, so that information needing long-time memory can be reserved and unimportant information is discarded, and the problem of analysis tasks of long-time sequences is solved.

The abnormality detection method comprises the following steps:

model architecture:

with the popularization and application of 5G networks, network traffic also shows explosive growth, and the characteristics of mass high-dimensional and long-time sequence of 5G network traffic data cause long consumption of abnormal detection data and poor detection performance. Aiming at the problem, the invention provides a 5G network traffic anomaly detection method based on GYAC-LSTM algorithm, the whole architecture is shown in figure 5, and the architecture has two main functional components, namely: data preprocessing and feature selection function and 5G flow abnormality detection function.

The data preprocessing and feature selection function mainly eliminates unimportant and redundant features in the feature set through feature selection to solve the problem of long time consumption caused by the difficulty in efficient processing of 5G mass high-dimensional data. The data preprocessing and feature selection functions are divided into data preprocessing, statistical functions and GYAC-based feature selection. Firstly, a condition filtering method is used, for example, a filter in a network packet analysis tool Wireshark is used for setting filtering conditions as corresponding ip and protocol, original traffic is preprocessed, irrelevant data is removed, the data is divided into original training current collecting quantity data and original testing current collecting quantity data, each data flow in the original traffic is marked with a label, and the label content is normal data and attack data containing attack. And then, a Wireshark statistical function is used for obtaining feature values in the data and generating candidate feature sets of the target training feature subset and the target test feature subset respectively. And finally, screening important features and removing redundant features by using a GYAC-based feature selection algorithm, and realizing feature dimension reduction to obtain a target training and testing feature subset.

The 5G flow abnormality detection function mainly solves the problem of poor abnormality detection performance caused by gradient disappearance and gradient explosion in the 5G long-time sequence training process by taking the dependence relationship of long-term dependence between 5G flow remote parameters into consideration and using a method of substituting cell states for neurons in a hidden layer. The anomaly detection function includes LSTM model and performance evaluation. And inputting the newly generated target training feature subset into the LSTM model for training after normalization processing. And then inputting the target test subset data into the trained LSTM model for anomaly detection, and finally performing performance evaluation according to detection and classification results.

Data preprocessing:

as shown in fig. 6, for collecting the original flow data, the data is first cleaned by using a filtering method to remove irrelevant data. Then the data stream d of the original data is ordered according to the time sequence, the maximum length is set as Z, the time interval is set as t, and all the data streams in t seconds are divided into a data stream subset flow _t Calculating flow by using statistical method _t K-dimensional feature set F of (2) _K,t Wherein F _K,t ＝{f _1,t ,f _2,t ,...,f _K,t Arranging the eigenvectors of all time intervals t to finally obtain a data set D of a class of flow _K,T Wherein D is _K,T ＝{F _K,1 ,F _K,2 ,...,F _K,T The flow is shown in fig. 6.

Feature selection:

the invention provides a GYAC-based feature selection algorithm, as shown in fig. 7, firstly, by researching the statistical properties of features, comprehensively considering the uncertainty difference and the relative size difference between the features, so that the contribution degree of classification of two types of flow data is represented by using a base index difference and a comparably reduced index together, the contribution degree of non-type data features is calculated and summed in sequence, the importance ranking of the types of flow features is generated, the redundancy among the important features is reduced, the cosine similarity is used for representing the correlation among different features in the similar flow data, the distance between the features is converted, then the feature with the correlation coefficient and the feature with the front importance coefficient ranking is selected according to a redundancy threshold and an importance threshold, and finally, the feature subset of each type of flow selection is obtained by combining the features, and the specific algorithm is as follows:

For a three-dimensional dataset D _N×K×T Where N represents the number of types of data traffic, the types of data traffic including: normal flow data, flow data containing 5G DoS Hulk Attack, 5G DoS Slowhttp Attack and 5G DDoS Bonesi Attack. K represents the dimension of the feature, T represents the length of the data, and is provided withTraffic data representing a certain data traffic type therein, D ⁿ _K,T The feature set of (a) can be expressed asWherein->N in F represents the type of data flow, the dimension and the data length of flow data are various, the dimension K of data is 1-K dimension, and the length of data is 1-T length. First, a base index G is defined for measuring the degree of uncertainty or the degree of uncertainty of a feature, as shown in formula (7).

In the middle ofRepresentation feature->The probability of different numerical values in the characteristic values is that the value of the base index of the characteristic values is between 0 and 1, and the closer to 0 is the lower uncertainty, and the closer to 1 is the higher uncertainty. To learn the characteristics->Whether or not to effectively distinguishTwo types of flow data, n is set ₁ ,n ₂ E N, representing the feature ∈N using the Basil difference>Data +.>And the contribution coefficient of the classification is one, as shown in a formula (8).

Is provided withRepresenting characteristic value +.>The average value at the total length T is shown as formula (9).

To learn the characteristics ofRelative sizes in different data types to distinguish between two types of traffic data, the feature ++is expressed using a homonymous reduction index>Data +.>And the contribution coefficient II of the classification is shown in the formula (10).

In the middle ofThe value is between 0 and 1, the closer to 0 is the smaller the difference between the characteristic values, and the closer to 1 is the larger the difference between the characteristic values, so that different types of flow data can be distinguished. The contribution coefficient I and the contribution coefficient II are weighted and summed together to be used as the characteristic +.>The total contribution of the two types of traffic data is distinguished as shown in formula (11).

By calculation ofThe contribution degree of the flow data and other flow data is summed to obtain the characteristic +.>The larger the coefficient is, the more +.>The higher the classification yield for all data, as shown in equation (12).

The feature importance coefficient algorithm is shown in table 1.

TABLE 1 feature importance coefficient algorithm

To remove redundant features in a feature set, the correlation between different features is calculated using cosine similarity, which isThe cosine value of the included angle of the feature is between-1 and 1, so that the closer to 0 is the lower the correlation of the feature, whereas the closer to 1 is the higher the correlation, and k is set ₁ ,k ₂ E K, thenAnd->The correlation coefficient of (2) is shown in formula (13).

For easy calculation and ordering, useThe distance between the two coefficients is represented by formula (14).

The distance between features is therefore limited to d>0, sequentially calculating f ⁿ _k,T All distances to other features, the feature distance coefficient algorithm procedure is shown in table 2.

TABLE 2 characteristic distance coefficient algorithm

According to the characteristicsThe importance coefficient and the distance coefficient of the redundancy threshold m are firstly defined ₁ I.e. retain and characteristic->The most distant features account forRatio, importance threshold m ₂ I.e. according to m ₁ Removal and characterization->After redundant features, the feature quantity duty ratio of the forefront ranking of the importance coefficient is reserved. Then, the +.>Is to +.>The feature subset of (2) is obtained by taking the union set _N×K×T Not only can the feature dimension be effectively reduced by the method, but also the optimal feature set of the method is realized by controlling m ₁ And m ₂ Feature subsets are dynamically generated to meet the needs of different scenarios.

LSTM model design

The invention takes the target feature subset selected by the invention as input, uses the LSTM model to detect and classify abnormal flow data, and the structure mainly comprises three parts, namely an input layer, an LSTM unit and a full connection layer, and is shown in figure 8.

In order to accelerate the network learning speed and avoid the problem that singular sample data cannot be converged, the input characteristic data is normalized at the input layer, as shown in the formulas (15) and (16).

In the middle ofRepresenting an input raw feature value; />Respectively representing the minimum value and the maximum value in the feature set; v (V) _max 、V _min Representing the maximum and minimum values of the mapping interval, respectively, typically set to 1 and 0; />Representing the scaled normalization result. Let the total time step of the data be T, each step contains K-dimensional features, the Input is denoted input= { X _1,K ,X _2,K ,...,X _T,K Therefore, each training requires T time steps, each time step corresponds to one LSTM unit, each unit processes the characteristic data of K dimension, and finally, the prediction result is input into the full connection layer. The Output dimension of the full connection layer is the type number of the traffic data, and the Output value is output= { Y assuming N ₁ ,Y ₂ ,...,Y _N In order to correspond the output of the full connection layer to the probability of predicting a certain category, mapping the output to be between 0 and 1 by using a softmax function, ensuring that the sum of all the outputs is 1, taking the sum as the probability value of classifying under the current input, setting y _i ＝[y ₁ ,y ₂ ,...,y _N ]Y of it _i The output probability value after the softmax function is shown in equation (17).

S in _i A probability value representing whether the model predicts as a normal data type or an abnormal type containing an attack.

Experimental analysis

The invention uses the constructed 5G network experimental platform for generating the 5G network abnormal flow data set and for verifying and analyzing the abnormal detection technology. The section firstly introduces a 5G network experimental environment, designs a 5G network abnormal flow attack case, verifies and collects a data set in the constructed experimental environment, and finally verifies and analyzes the GYAC-LSTM-based 5G network flow abnormality detection technology.

5G network experimental environment

The invention uses the constructed 5G network experimental platform to verify the experimental cases. The 5G network experiment platform is built through simulation software and a test tool, and the simulation software, the simulation RAN, the 5GC, the Postman signaling test software, the Wireshark packet capturing software and the Ipref network test software are mainly used as the experiment tool. The whole structure of the 5G network experiment platform mainly comprises an interaction module, a 5G network module and a test module. The interaction module is used for issuing a 5G network configuration strategy, and comprises simulating UE registration and authentication keys, a UE access interface, a simulated RAN interface, UE subscription in 5GC and the like. The 5G network module is deployed in two virtual machines, wherein the simulation UE and the simulation RAN are deployed in the virtual machine 1 by using UERANSIM software, the 5GC is deployed in the virtual machine 2 by using Free5GC, the Free5GC uses a service-based architecture, the functions of network elements including AMF, SMF, UDM, UDR, NSSF, PCF, AUSF, NRF, UPF and the like are realized, each network element can set interface information through configuration files, and the operation principle of the 5G network module is the same as that of a real 5G network environment. The test module is mainly used for network signaling test, flow acquisition, network state monitoring and the like, and consists of a packet grabbing tool, a network test tool and the like.

5G network traffic attack case

The combination of the 5G network and the Internet of things can lead to hundreds of millions of unsafe devices and a large number of node connections, so that a large-scale botnet can be constructed, and the botnet formed by the 5G mass Internet of things devices can provide stronger, more complex and more harmful DDoS flow attacks. In the 5G network architecture, the UPF network element is responsible for the functions related to the routing and forwarding of the user plane data, and uses interfaces such as N3, N4, N6 to interact with other network elements, so as to generally sink to the unattended base station side or the vicinity of the edge cloud to meet the time delay requirement, resulting in lower security of the UPF network element, and once being utilized or hijacked by an attacker, the UPF network element will cause serious security threat to the 5GC and other important network elements, and face the danger of paralysis of the large-scale network, as shown in fig. 9. According to the architecture of the 5G network and the protocol used, the invention designs three attack schemes, including two types of 5G DoS Hulk Attack and 5G DoS Slowhttp Attack initiated for the defect of the HTTP protocol used inside the 5GC and 5G DDoSBonesi Attack for the UPF network element, and the experimental environment of the 5G network experimental platform introduced above is used.

(1)5G DoS Hulk Attack

The attack uses HTTP protocol between 5GC network elements, initiates high frequency HTTP GET FLOOD requests by controlling UE to start a large number of threads, each request is independent, and various confusion techniques are used to bypass the cache measures of the server side, so that all requests are processed, and the purpose of the attack is to occupy a large amount of thread resources and computing resources of 5 GC. The specific attack scheme is that a 5GC is started in a virtual machine 2, the routing configuration of a user plane UPF network element is modified to enable user plane traffic data to attack the 5GC network element (http-server), then a simulated UE is generated in the virtual machine 1 to initiate a registration process and access a 5G network, a virtual network card is generated after PDU session is established, and Hulk traffic tools such as an IP proxy, an attack target, an attack confusion technology and the like are configured to initiate a plurality of attacks, wherein each attack lasts for 80s. The attack is characterized in that the attack source is only one device, the device sends out a large number of data packets, the time interval for sending the packets is extremely short, the 5GC instant access quantity is increased by tens of times or even hundreds of times, a large amount of computing resources and memory resources of the 5GC are occupied, the service of normal UE cannot be normally served, and the attack is a typical network attack behavior.

Based on the above case 5G DoS Hulk Attack, the present invention performs verification and data acquisition on the 5G network experiment platform, firstly accesses the simulated UE to the 5GC, then initiates 5G DoS Hulk Attack, sets the ip proxy as the virtual network card generated by the simulated UE, and the attack target as the SMF network element in the 5GC, and after the attack starts, the simulated UE sends a large number of independent HTTP requests, as shown in fig. 10.

The 5GC needs to process a large number of requests, resulting in failure to provide registration services for normal UEs, as shown in fig. 11.

After stopping 5G DoS Hulk Attack, the registration procedure of the UE can be handled normally, as shown in fig. 12.

(2)5G DoS Slowhttp Attack

The attack is also called HTTP slow attack, and utilizes HTTP legal mechanism used by 5GC network element, namely, by changing HTTP request format, it can keep connection with 5GC for a long time, continuously consume and occupy a lot of thread resource, memory resource, etc. of 5 GC. The specific scheme is as follows: starting 5GC in the virtual machine 2, and modifying the routing configuration of the user plane UPF network element to enable the user plane flow data to attack the 5GC network element. A simulated UE initiated registration process is generated in the virtual machine 1 to access a 5G network, a virtual network card is generated after PDU session is established, and connection frequencies are respectively set to be 10/S, 50/S and 100/S and attack is initiated for 80S each time by configuring a slow-ttpts flow tool such as proxy IP, maximum connection number, connection interval time and the like. The attack is characterized in that the attack source is only one device, the device sends out a large number of TCP requests, and each request keeps the connection unreleased for a long time, so that the 5GC service is not available, and the attack is a typical network attack behavior.

Based on the above 5G DoS Slowhttp Attack cases, the present invention performs verification and data acquisition on a 5G network experiment platform, firstly accesses a 5GC to a simulated UE, then initiates 5G DoS Slowhttp Attack, sets an ip proxy as a virtual network card generated by the simulated UE, and the attack target is an SMF network element in the 5GC, the connection interval lasts for 10s, the duration is 80s, and the configuration is shown in fig. 13.

A large number of independent HTTP requests will be sent after the attack begins, showing that the service is not available after 30s, as shown in fig. 14.

The SMF network element in 5GC receives a large number of HTTP requests, which causes the number of files opened by the system and the number of communication links to exceed the maximum limit, as shown in fig. 15.

(3)5G DDoS Bonesi Attack

The attack is a DDoS attack simulating a botnet, a large number of TCP, UDP, ICMP requests are generated to the UPF in a short time by controlling the botnet, and the UPF can not provide services for legal users by setting different parameters such as sending rate, load size and the like. Specifically, 5GC is started in the virtual machine 2. Generating 10 simulated UE in the virtual machine 1 to initiate a registration process as a botnet to access a 5G network, generating 10 virtual network cards after establishing a PDU session, respectively initiating a large-flow attack to a UPF, respectively setting attack frequencies to 100/S, 500/S and 1000/S by configuring Bonesi flow tools such as proxy IP, using protocol types and the like, and initiating a plurality of attacks, wherein each attack lasts 80S. The attack is characterized in that a plurality of devices are provided as attack sources, each device sends out a large number of data packets, the packet sending time interval is extremely short, and the 5GC receives a large number of data packets from different devices, occupies a large number of resources of a server side and is a typical network attack behavior.

Based on the above 5G DDoS Bonesi Attack cases, the present invention performs verification and data acquisition on a 5G network experiment platform, first generates 10 simulated UEs to initiate a registration process and access 5GC, then initiates 5G DDoS Bonesi Attack, sets ip agents respectively, and targets an attack to UPF network elements in 5GC, and after attack starts, each simulated UE sends a large number of TCP requests, as shown in fig. 16.

At this time, a simulated UE initiating registration process is generated again, then packet filling software is used to send data packets to UPF to simulate the traffic service of normal UE, compared with the packet filling data before and after the attack of 5G DDoSBonesi Attack figure 17 and figure 18, the packet loss rate is increased from 0.45% to 79%,

and 5G network abnormal traffic collection:

when abnormal traffic is collected, all traffic is set to grab packets at the interface of the UPF network element, and the reason is that the UPF realizes the functions related to the routing and forwarding of the user plane data packets, and the user traffic of all the UE in the range area needs to flow through the interface, so that the traffic is prevented from being collected and processed when the cooperative attack is oriented, and the detection efficiency is improved. And each interface of the UPF is connected with a plurality of important network element nodes, such as using an N3 interface to connect with a 5G base station, using an N4 interface to connect with a 5GC and using an N6 interface to connect with a DN, so that the traffic collection carried out on the interface can capture a plurality of malicious traffic aiming at different network element nodes.

The invention collects 1426457 marking flows in UPF network element interface, which contains 4 types of data, including normal flow, 5G DoS Hulk Attack, 5G DoS Slowhttp Attack and 5GDDoSBonesi Attack flow, and uses sliding window to divide each type of data into data subsets, wherein the window size is set to 50, that is, each data subset contains all data flows within 50 seconds, 28 characteristics of each time data set are counted according to time sequence, as shown in table 3, and finally the characteristic set of all processed data is obtained.

Table 3 dataset characteristics

Analysis of abnormality detection results

The invention firstly analyzes the feature selection based on GYAC algorithm, which comprises the important features which can be selected, the feature dimension changes along with the threshold value, and the like. And then, detecting and classifying the 5G network traffic by using a GYAC-LSTM model, and researching the relation between the detection performance and the training cost under different characteristic dimensions. Finally, the Tad-GAN and DAGMM models are used for comparison and analysis with the detection model provided by the invention.

The Performance indexes used in the invention comprise Accuracy (Acc), precision (Precision), recall (Recall), harmonic mean (F1 score), performance-Cost (PC) comprehensive benefit representation, wherein the PC comprehensive benefit is the ratio of Acc to training Time (Train-Time), the larger the value is the higher the comprehensive benefit of Acc and Train-Time Cost, and the lower the comprehensive benefit is, and the formulas of Acc and PC comprehensive benefit are shown as formula (18) and formula (19).

The invention uses a GYAC-based feature selection algorithm to evaluate and sort importance coefficients according to the contribution degree of each feature class in four types of total data sets including normal data, 5G DoS Hulk Attack data, 5G DoS Slowhttp Attack data and 5G DDoSBonesi Attack data, and as shown in figures 19 to 22, features with the importance coefficients of the features of the front 10 and the importance scores of the features in each type of data type are selected under the algorithm proposed by us.

FIG. 23 shows different redundancy thresholds m ₁ Lower, m ₁ Taking 0.25, 0.5, 0.75 and 1.0 respectively, the feature dimension reduction rate and the importance threshold m ₂ Curves in between. In the proposed feature selection algorithm, the feature reduction rate can reach about 0.85 at the highest. With the importance threshold m ₂ The feature dimension reduction rate is reduced, the dimension reduction capability is reduced, and the feature dimension reduction rate is finally reduced to 0, namely the dimension reduction of the data is not performed. This is due to the fact that the redundancy threshold m is passed ₁ After feature selection of m ₂ The larger allows all features to be retained without the resulting feature dimension reduction.

At the same importance threshold m ₂ At the same time, with redundancy threshold m ₁ Is increased, the feature dimension is reduced and increased, and the dimension reducing capability is improved. This is due to the fact that m ₁ The larger the value, the better the dimension reduction capability, because the sample space of the redundant features is along with m ₁ Increase and increase. At this time at the same importance threshold m ₂ The more importance features that can be selected, the greater the probability of selecting the most important features, so that more identical features can be removed when the union is finally found.

In addition, it can be observed from the graph that the decay rate of the feature dimension reduction rate varies with m ₁ The reason for the increase and decrease is similar to the above, and since the redundant feature sample space is small, it is more difficult to select the feature with higher importance, and the important feature is low in the same rate when the union is obtained, resulting in poor dimension reduction capability.

In order to verify whether the target feature subset after dimension reduction can effectively detect and classify normal and various abnormal flows, the invention uses the restriction of different m ₁ And m ₂ The threshold value is set to be 0.25, 0.50 and 0.75 of the total feature set respectively to obtain a plurality of targets under the corresponding feature dimension ratioThe feature subset, as shown in Table 4, is represented at m ₁ When the values are 0.25, 0.50, 0.75 and 1.00, m is the ratio of each characteristic dimension ₂ Wherein a null value indicates that it is not possible to reduce to the set target feature dimension under the current conditions.

TABLE 4LSTM model parameter settings

The data sets of different feature dimension duty ratios and full feature sets described above were used for training and testing of LSTM classification models, whose LSTM network parameters are shown in table 5.

TABLE 5LSTM model parameter settings

As shown in fig. 24, a graph showing the change of the detection performance and the training time under different feature dimensions is shown, the abscissa is the ratio of the feature dimension to the total feature dimension of the data set for training and testing, the ordinate is the detection performance and the training time under the corresponding feature dimension, it can be seen that the larger the Acc curve and the F1 value curve of the model classification are along with the increase of the feature dimension on the overall trend, and the accuracy and the F1 value can be kept at about 90% when the feature dimension ratio is only 0.25, and when the feature dimension ratio reaches 0.75, the Acc value and the F1 value both exceed 98%, which is close to the detection performance of the full feature set, and the normal abnormal flow and the abnormal flow can be basically classified correctly. Meanwhile, the training time of the detection model is increased along with the increase of the feature dimension, and compared with the training time when feature selection is not used, the training time can be reduced by about 15% at most.

To verify the relationship between the detection performance and the model training time cost, the performance-cost benefit of anomaly detection in different feature dimensions is analyzed, as shown in fig. 25, which shows a graph of performance duty ratio and performance-cost benefit variation in different feature dimensions, to unify the scales of the detection performance and the training time of the model, the invention converts Acc and training time in different feature dimensions into Acc and training time in the current feature dimension, which is the ratio of the sum of all accs to the sum of all training time, and normalizes the PC comprehensive benefit index, so that the abscissa of the graph is the ratio of the feature dimension to the total feature dimension of the dataset for training and testing, and the ordinate is the ratio of Acc and training time in the current feature dimension, and the performance-cost benefit.

It can be seen that as the feature dimension ratio increases, the Performance-Performance curve and the Time-Performance curve of the model gradually rise, which means that the better the detection effect is, but the longer the detection Time is, the Performance-Cost curve shows the PC comprehensive benefit index of Acc and training Time under different feature dimensions, and the maximum benefit is seen when the feature ratio is 0.25 and 0.5, and the benefit is reduced by about 20% when the feature ratio is 0.75, but the comprehensive benefit is far greater than the comprehensive benefit of the feature selection algorithm which is not used, so that the advantages of the detection model provided by the invention are verified.

The invention also compares the two classical abnormal flow detection models of Tad-GAN and DAGMM with the model of the invention, wherein Tad-GAN is a generating model comprising two GAN networks, and can decompose a time sequence into a smooth basic signal and a noise signal, and is often used as a comparison standard for detecting the abnormal of the time sequence. DAGMM is an anomaly detection model combining a depth self-encoder and gaussian mixture, and has better applicability in processing data of high and complex distribution.

As shown in FIG. 25, the graph showing the detection performance and PC comprehensive benefits of different models is shown, the abscissa is different detection models, including GYAC-LSTM (0.50) model with dimension ratio of 0.50, LSTM model without feature selection, tad-GAN and DAGMM models, and the ordinate shows the detection performance and training time cost of each model, and for convenience of comparison, the invention normalizes PC comprehensive benefits index.

From the Acc index, the proposed GYAC-LSTM (0.50) is slightly lower than other models and is about 95 percent, because part of important features are removed after feature selection, the whole Acc is reduced, from the F1 index, the F1 values of the proposed GYAC-LSTM and LSTM models and Acc float less, because the training process uses all normal flow and abnormal flow, the Precision and Recall of the models are relatively close, the F1 values of the Tad-GAN and DAGMM models are lower, the false detection rate of abnormal flow is lower or the false detection rate of normal flow is higher, mainly because the models judge the flow exceeding the normal threshold value as abnormal, the problem of the hard threshold value of the models causes the imbalance problem of the Precision and Recall indexes, but the model training process only needs to use normal flow data, and the dependence on flow type labels is reduced.

From PC index, because we use feature selection algorithm to reduce feature dimension, GYAC-LSTM (0.50) has higher PC comprehensive benefit, and compared with the similar algorithm, the training time cost is saved by 8% -44%. The Tad-GAN and DAGMM models have more complex models due to the inclusion of various network structures, and therefore have longer training time, resulting in lower PC indexes.

The invention provides a GYAC-LSTM-based 5G network flow anomaly detection method, which comprises the steps of firstly preprocessing original data by using conditional filtering, then obtaining characteristic values in the data by using a statistical method and generating a candidate characteristic set, then screening important characteristics and removing redundant characteristics based on a GYAC characteristic selection algorithm, reducing the dimension of experimental characteristics, and using a newly generated target characteristic subset as input for LSTM model training. Experimental results show that the method can greatly reduce the characteristic dimension and can keep the accuracy of identifying abnormal flow at 93% -98%. Compared with the similar algorithm, the training time cost is saved by 8-44%.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

the GYAC-LSTM-based 5G network traffic anomaly detection system comprises:

for a three-dimensional dataset D _N×K×T Wherein N represents the number of data types, K represents the dimension of the feature, T represents the length of the data, andrepresenting a certain type of traffic data, then +. >Is expressed as a feature set of (1)Wherein-> The uncertainty or uncertainty coefficient of a feature is measured by a base index, as shown in formula (7):

/>

in the middle ofTake a value between 0 and 1, and sum the contribution coefficient one and the tributeThe second contribution coefficient takes different weights and sums them together as the feature +.>The total contribution of the two types of flow data is distinguished as shown in the following formula (11):

by calculation ofThe contribution degree of the flow data and other flow data are summed to obtain the characteristic +.>Is represented by the following formula (12):

/>

Claims

1. The 5G network traffic anomaly detection method based on GYAC-LSTM is characterized by comprising the following steps:

and a data preprocessing step, namely acquiring 5G network flow data marked with abnormal information, and performing conditional filtering on the 5G network flow data to obtain original training data.

And a feature selection step, wherein network packet analysis is carried out on the original training data to obtain feature values in the data, respective candidate feature sets are respectively generated, and GYAC-based feature selection is used for carrying out feature selection on each candidate feature set so as to remove network traffic data in the candidate feature set, so that target training data is obtained.

And training, namely inputting the target training data into an LSTM model, executing 5G network traffic abnormality detection, and constructing a loss function to train the LSTM model based on the detection result and the abnormal information marked by the target training data to obtain a traffic abnormality detection model.

2. The 5G network traffic anomaly detection method based on GYAC-LSTM as claimed in claim 1, wherein the data preprocessing step includes:

firstly, cleaning data by using a filtering method to remove irrelevant data, then sequencing the cleaned flow data according to a time sequence, setting the maximum length as Z and the time interval as t, and dividing all data flows within t seconds into a data flow subset flow _t Calculating flow by using statistical method _t K-dimensional feature set F of (2) _K,t Wherein F _K,t ＝{f _1,t ,f _2,t ,...,f _K,t Arranging the eigenvectors of all time intervals t to finally obtain a data set D of a class of flow _K,T Wherein D is _K,T ＝{F _K,1 ,F _K,2 ,...,F _K,T }。

3. The 5G network traffic anomaly detection method based on GYAC-LSTM as claimed in claim 2, wherein the feature selection step includes: the contribution degrees of the different types of flow data are sequentially calculated, summed and ranked by using the difference value of the base-Ni index and the comparably reduced index to jointly represent the contribution degrees of the two types of flow data, so that the importance ranking of the flow characteristics is obtained; and (3) expressing the correlation among different features in the similar flow data by using cosine similarity, converting the correlation into the distance between the features, selecting the features with the correlation coefficient and the importance coefficient ranked at the front according to the redundancy threshold and the importance threshold, and finally obtaining the target training data by summing the selected features of each class of flow.

4. The GYAC-LSTM based 5G network traffic anomaly detection method of claim 3, wherein the feature selection step comprises:

in the middle ofRepresentation feature->The probability of different values in the model is that the value of the base index of the characteristic value is between 0 and 1;let n be ₁ ,n ₂ E N, representing the feature ∈N using the Basil difference>Data +.>And the contribution coefficient of the classification is one, as shown in a formula (8):

5. The GYAC-LSTM based 5G network traffic anomaly detection method of claim 1, wherein the training step comprises:

in the middle ofRepresenting an input raw feature value; />Respectively representing the minimum value and the maximum value in the feature set; v (V) _max 、V _min Respectively representing the maximum value and the minimum value of the mapping interval; / >Representing the scaled normalization result; let the total time step of the data be T, each step contains K-dimensional features, the Input is denoted input= { X _1,K ,X _2,K ,...,X _T,K Each training comprises T time steps, each time step corresponds to an LSTM unit, each unit processes K-dimensional characteristic data, and finally, a prediction result is input into a full-connection layer; the Output dimension of the full connection layer is the type number of the traffic data, and the Output value is output= { Y assuming N ₁ ,Y ₂ ,...,Y _N In order to correspond the output of the full connection layer to the probability of predicting a certain category, mapping the output to be between 0 and 1 by using a softmax function, ensuring that the sum of all the outputs is 1, taking the sum as the probability value of classifying under the current input, setting y _i ＝[y ₁ ,y ₂ ,...,y _N ]Y of it _i The output probability value after the softmax function is shown in the following formula (17);

6. The utility model provides a 5G network traffic anomaly detection system based on GYAC-LSTM which characterized in that includes:

7. The GYAC-LSTM based 5G network traffic anomaly detection system of claim 6, wherein the data preprocessing module comprises:

8. The GYAC-LSTM based 5G network traffic anomaly detection system of claim 7, wherein the feature selection module comprises: the contribution degrees of the different types of flow data are sequentially calculated, summed and ranked by using the difference value of the base-Ni index and the comparably reduced index to jointly represent the contribution degrees of the two types of flow data, so that the importance ranking of the flow characteristics is obtained; and (3) expressing the correlation among different features in the similar flow data by using cosine similarity, converting the correlation into the distance between the features, selecting the features with the correlation coefficient and the importance coefficient ranked at the front according to the redundancy threshold and the importance threshold, and finally obtaining the target training data by summing the selected features of each class of flow.

9. The GYAC-LSTM based 5G network traffic anomaly detection system of claim 8, wherein the feature selection module comprises:

for a three-dimensional dataset D _N×K×T Wherein N represents the number of data types, K represents the dimension of the feature, T represents the length of the data, and Representing a certain type of traffic data, then +.>Is expressed as a feature set of (1)Wherein-> The uncertainty or uncertainty coefficient of a feature is measured by a base index, as shown in formula (7):

sequentially calculatingAll distances from other features; according to the characteristics- >The importance characteristic selection is carried out on the importance coefficient and the distance coefficient of the system, and the system is used for carrying out the importance characteristic selection according to the preset redundancy threshold value m ₁ Retention and characterization->The feature quantity ratio of the farthest distance is calculated according to a preset importance threshold value m ₂ According to m ₁ Removal and characterization->After redundant features, the feature quantity ratio of the forefront ranking of the importance coefficient is reserved to obtain +.>Is to +.>Is obtained by taking the union of the feature subsets of (2) to obtain D _N×K×T Is included in the target training data.

10. The GYAC-LSTM based 5G network traffic anomaly detection system of claim 6, wherein the training module comprises: