CN117318980A - Small sample scene-oriented self-supervision learning malicious traffic detection method - Google Patents

Small sample scene-oriented self-supervision learning malicious traffic detection method Download PDF

Info

Publication number
CN117318980A
CN117318980A CN202310910097.9A CN202310910097A CN117318980A CN 117318980 A CN117318980 A CN 117318980A CN 202310910097 A CN202310910097 A CN 202310910097A CN 117318980 A CN117318980 A CN 117318980A
Authority
CN
China
Prior art keywords
flow
malicious
training
feature
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310910097.9A
Other languages
Chinese (zh)
Inventor
沈蒙
叶珂
贾冀哲
王伟
岳光纯
张大伟
吴金贺
欧嵬
祝烈煌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Publication of CN117318980A publication Critical patent/CN117318980A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a malicious flow detection method based on self-supervision learning under a small sample condition, and belongs to the technical field of encryption network flow classification. According to the invention, through analyzing the flow interaction process, three characteristics of the length, the protocol and the arrival time interval of a data packet of the flow can be analyzed to effectively distinguish different types of flows, the characteristic embedding is realized by utilizing a continuous word bag model, a flow expression matrix is constructed, the matrix is combined with a self-supervision learning model, the characteristic learning of a non-tag flow sample is realized, a flow characteristic encoder network is constructed, on the basis, a small amount of flow samples with tags are used for training a full connection layer, and the encoder is connected with the full connection layer, so that a malicious flow detection model is obtained. Because the learning process of the model only uses a small amount of tagged data, the problems that the tagged malicious traffic samples are few and the supervised learning model construction requiring a large amount of samples is difficult to realize are effectively solved, and the malicious traffic detection under the condition of small samples is realized.

Description

Small sample scene-oriented self-supervision learning malicious traffic detection method
Technical Field
The invention relates to a malicious traffic detection method based on self-supervision learning under a small sample condition, and belongs to the technical field of encryption network traffic classification.
Background
With the rapid development of the internet, the complexity of the network topology and the equipment scale are also increased, and malicious network traffic tends to increase explosively. Conventional malicious traffic detection methods typically scan the packet content to discover malicious traffic according to a predetermined fixed string pattern, such as malicious traffic unique features that are unlikely to be discovered in any benign traffic. However, with the application of encryption protocols (e.g., SSL/TLS), the payload-based anomaly detection approach gradually decreases in effectiveness. Therefore, it is necessary to propose a malicious traffic detection method suitable for encrypted traffic.
In order to realize encrypted malicious traffic detection, most of the current methods are mainly constructed based on a supervised learning model of classification, the method regards malicious traffic detection as a classification task, and a classifier is trained by comprehensively using normal traffic and malicious traffic as inputs. However, this approach requires a significant amount of tagged traffic data to model. The traffic collected in the real world is unlabeled, for example, traffic collected at a gateway is unlabeled data, and the labels are difficult to set for the traffic by manpower, and meanwhile, the construction of a malicious traffic data set with labels requires an independent target range environment and a long-time traffic collection process, so that the quick online deployment of a malicious traffic detection model is difficult to deal with. Therefore, in order to effectively detect malicious traffic, it is necessary to invent a method for detecting malicious traffic under the condition of small sample based on self-supervision learning.
Disclosure of Invention
The invention aims to solve the problems that a large number of samples are required for supervised learning model construction due to few labeled malicious traffic samples, and mainly aims to provide a malicious traffic detection method based on self-supervised learning under the condition of small samples, and aims to detect malicious traffic by using a large amount of unlabeled traffic data and a small amount of labeled traffic data under the condition that network traffic is encrypted. The method extracts the attribute of each data packet, utilizes feature embedding to construct the feature vector of each flow, only uses non-tag data to train a feature encoder, freezes the network parameters of the encoder, uses a small amount of tagged data to train a full connection layer, connects the encoder with the full connection layer, constructs a malicious flow detection model, realizes the timely blocking of malicious flow, and ensures that equipment resources are prevented from being infringed.
The aim of the invention is achieved by the following technical scheme.
The invention discloses a malicious flow detection method based on self-supervision learning under a small sample condition, which is characterized in that three characteristics of the length, the protocol and the arrival time interval of a data packet of a flow can be found to effectively distinguish different types of flows by analyzing a flow interaction process, a continuous-word-bag-of-words (CBOW) model is utilized to realize characteristic embedding, a flow expression matrix is constructed, the matrix is combined with a self-supervision learning model to realize characteristic learning of unlabeled flow samples, a flow characteristic encoder network is constructed, on the basis, a full-connection layer is trained by a small number of labeled flow samples, and an encoder is connected with the full-connection layer, so that a malicious flow detection model is obtained. Because the learning process of the model only uses a small amount of tagged data, the problems that the tagged malicious traffic samples are few and the supervised learning model construction requiring a large amount of samples is difficult to realize are effectively solved, and the malicious traffic detection under the condition of small samples is realized.
The invention discloses a malicious flow detection method based on self-supervision learning under a small sample condition, which comprises the following steps:
step 1: for a flow, extracting the length of a data packet, the protocol of the data packet and the arrival time interval of a specific number before constructing a feature matrix, and storing a flow sample for embedding the flow features in the step 2.
For a stripe containing n packets p i Flow t= { p of (i.ltoreq.n) 1 ,p 2 ,...,p n Extracting the packet length (l) of B (B.ltoreq.n) packets before i ) Packet protocol (q i ) And packet arrival time interval (t i ) Constructing a 3*B flow characteristic matrix, and storing a flow sample for flow characteristic embedding in the step 2.
Step 2: in view of the fact that the length, the protocol and the arrival time interval of the independent data packets lack context semantic information, the flow characteristics cannot be accurately expressed, a CBOW model is used, feature embedding operation is carried out on the data packet attributes by utilizing a specific number of data packets before and after each data packet, and three features of the length, the protocol and the arrival time interval of the data packets are respectively expanded to specific dimensions through the feature embedding operation, so that the flow characteristics are rich. The CBOW model consists of an input layer, a hidden layer and an output layer, and a weight matrix is obtained by training the CBOW model and is used for realizing feature embedding. And counting different attribute values of all the data packets, constructing an attribute value dictionary, wherein an input layer of CBOW is single-heat coding representation of a plurality of features, obtaining a hidden layer and an output layer through matrix calculation, obtaining probability of predicting the attribute values of the data packets according to the output layer, taking the attribute value with the maximum probability as output, calculating the error between the attribute value and a true value, counter-propagating and updating network parameters, and after multiple iterations, using a network parameter matrix as a feature vector matrix for realizing feature embedding, wherein each row corresponds to one feature value, and expanding the features into vectors with specific dimensions according to the matrix. According to the steps, a CBOW model for realizing feature embedding of the data packet length, the protocol and the arrival time interval is trained in sequence, a corresponding feature vector matrix is obtained, for each attribute value, the corresponding feature vector is matched, the attribute value which cannot be matched is replaced by a 0 vector, flow feature embedding is realized, one-dimensional features are expanded to high dimensions, context semantic information is fused, flow features are enriched, and flow expression is obtained.
In view of the fact that the length, the protocol and the arrival time interval of the independent data packets lack context semantic information, the flow characteristics cannot be accurately expressed, a CBOW model is used, feature embedding operation is carried out on the data packet attributes by utilizing a specific number of data packets before and after each data packet, and three features of the length, the protocol and the arrival time interval of the data packets are respectively expanded to specific dimensions through the feature embedding operation, so that the flow characteristics are rich. The CBOW model consists of an input layer, a hidden layer and an output layer, and a weight matrix is obtained by training the model, and the matrix is used for realizing feature embedding.
Different attribute values of all data packets are counted, and an attribute value dictionary D= { S is constructed 1 ,S 2 ,...,S V Wherein V is the number of different attributes, and the input layer is a single-hot coded representation of multiple features { x } 1 ,x 2 ,...,x C X, where x i (i.ltoreq.C) represents the single thermal encoding of the ith feature, and the hidden layer has the following calculation formula:
wherein W is a matrix of V x N, N is the dimension of the hidden layer, and the obtained hidden layer h passes through the matrix W' to obtain an output layer o:
o=W′ T *h
wherein W' is a matrix of N x V, and predicting the attribute value of the data packet as S according to the output layer k The probability is:
taking the attribute value with the maximum probability as output, calculating the error between the attribute value and the true value, and back-propagating to calculate W and W' matrixes, wherein the W matrix is a characteristic vector matrix after multiple iterations, and W k Corresponds to attribute S k Is described.
According to the steps, feature vectors of the data packet length, the data packet protocol and the data packet reaching time interval are sequentially calculated, the corresponding feature vector is matched for each attribute value, the attribute value which cannot be matched is replaced by a 0 vector, flow feature embedding is achieved, one-dimensional features are expanded to high dimensions, context semantic information is fused, flow features are enriched, and flow expression is achieved.
Step 3: traffic sample data enhancement. In view of the probability that a malicious attacker increases the flow concealment by a flow confusion mode, partial information of the original flow is destroyed by adding a random data packet to the original flow and increasing random time delay, the method is used for encoder network training in the step 4, and the correlation between the original flow and the confusion flow is learned through the reserved characteristics, so that the malicious flow detection with the concealment increased by the confusion mode is realized.
For each packet p in a traffic T i After which toIs added with a length l' i (0≤l′ i 1500) and selecting a predetermined proportion of the data packets by a simple random sampling methodIncreasing random time delay deltat i (0≤Δt i Less than or equal to 0.2 s) to have an arrival time interval t i +Δt i Extracting the confused flow characteristics, obtaining a flow sample with enhanced data through characteristic embedding, enabling the flow sample with enhanced data to act on the encoder network training in the step 4, and learning the correlation between the original flow and the confused flow through the reserved characteristics to realize malicious flow detection with increased concealment through the confusion mode.
Step 4: dividing all data sets into training sets and test sets according to a preset proportion, and taking all training set data without labels to realize the network training of the flow characteristic encoder. And extracting deep vector representations from the original feature representations and the data-enhanced feature representations of each flow by using a convolutional neural network, calculating the similarity of all the deep vector representations, constructing a loss function by using the similarity, and carrying out back propagation iterative calculation on the encoder network parameters to obtain a flow feature encoder network, wherein model pre-training is realized by using unlabeled flow samples only, and the efficiency of malicious flow detection model construction is improved.
Dividing all data sets into training sets and test sets according to a preset proportion, and taking all training set data without labels for self-supervision learning model training. For each flow T i Is used for extracting deep vector representation(s) by using a convolutional neural network i And s' i ). Calculating the similarity of all matrixes, wherein the similarity function is defined as:
where u and v are deep vector representations of two samples extracted by the convolutional neural network, respectively, and the loss function for any one traffic sample is defined as:
and Q is the total number of samples, and the convolutional neural network automatically learns the characteristic extraction mode of the flow samples according to the loss iterative calculation parameters. And (3) calculating encoder network parameters by using the back propagation iteration of the loss function, finally obtaining a flow characteristic encoder network, and realizing model pre-training by using only unlabeled flow samples, thereby improving the efficiency of malicious flow detection model construction.
In order to improve the effectiveness of the characteristics, preferably, the convolutional neural network adopts a resnet-50.
Step 5: and (3) connecting a full-connection layer after the characteristic encoder network is trained in the step (4), fixing the encoder network parameters unchanged, randomly selecting each type of sample with a specific proportion in a training set, training the full-connection layer, defining a loss function as cross entropy, calculating the full-connection layer network parameters by using the back propagation of the loss function, and obtaining a trained malicious flow detection model after the preset iteration times are reached.
Adding a full-connection layer after a convolutional neural network layer, randomly selecting n% of each type of sample in a training set, training the full-connection layer, wherein a loss function is cross entropy loss:
where Q is the total number of samples, y i Representing the flow rate T i Is malicious traffic (y) i =1) or normal flow (y i =0),p i Representative flow T i Probability of malicious traffic. And calculating the network parameters of the full-connection layer by using the back propagation of the loss function, and obtaining a trained malicious flow detection model after the iteration times are reached.
Step 6: the malicious flow detection model is obtained by connecting the encoder network trained in the step 4 and the full connection layer trained in the step 5, only a small amount of tagged data is needed in the training process of the malicious flow detection model, and malicious flow detection under a small sample scene is realized through the malicious flow detection model.
The beneficial effects are that:
1. the invention discloses a malicious flow detection method based on self-supervision learning under a small sample condition, which discovers that three characteristics of the length, the protocol and the arrival time interval of a data packet of flow can effectively distinguish different types of flow, combines the extracted characteristics with a self-supervision learning model, realizes the training of a flow characteristic encoder on unlabeled data, trains a full-connection layer by using a small amount of flow samples with labels on the basis, and connects the encoder with the full-connection layer to obtain a malicious flow detection model. Because the learning process of the model only uses a small amount of data with labels, the problem that the model is difficult to construct due to the lack of labels is effectively solved, and malicious flow detection in a small sample scene is realized.
2. According to the method for detecting malicious traffic under the small sample condition based on self-supervision learning, disclosed by the invention, the characteristic embedding operation is carried out on the data packet attribute by utilizing the data packets with specific numbers before and after each data packet, the three characteristics of the data packet length, the protocol and the arrival time interval are respectively expanded to specific dimensions through the characteristic embedding operation, so that the abundant traffic characteristics are realized, and the accuracy of detecting the malicious traffic is improved.
3. According to the method for detecting malicious traffic under the condition of the small sample based on self-supervision learning, disclosed by the invention, the confusion sample with enhanced data is obtained by randomly adding redundant data packets and time delay, the encoder network learns the correlation between the original traffic and the confusion traffic through the reserved characteristics, and the detection of the malicious traffic with increased concealment in a confusion manner is realized by using a large amount of unlabeled traffic data and a small amount of tagged traffic data.
Drawings
FIG. 1 is a flow chart of a malicious flow detection method under a small sample condition based on self-supervision learning;
Detailed Description
For a better description of the objects and advantages of the present invention, the following description of the invention refers to the accompanying drawings and examples.
Example 1:
FIG. 1 is a flow chart of malicious traffic detection. The embodiment discloses a malicious flow detection method based on self-supervision learning under a small sample condition, which comprises the following implementation steps:
step 1: extracting the packet length, packet protocol (IP, TCP, UDP, HTTP, TLS, DNS and others, respectively indicated by numerals 1-7) and packet arrival time interval of the first 10 packets of each flow, constructing a 3 x 10 flow characteristic matrix, and storing the flow samples in csv format.
Step 2: the embedded feature vector is calculated for the length, protocol, and time of arrival of each packet using the first 2 and last 2 packets of that packet. In the method, the learning rate of the CBOW model is set to be 0.001, the dimension of the hidden layer is set to be 100, namely the dimension of each feature vector is 1 x 100, and the feature dimension of each flow is 30 x 100.
Step 3: and generating mixed flow for each flow, adding random redundant data packets with 10% probability before each data packet, adding random time delay for each data packet with 50% probability, and extracting a feature matrix of the mixed flow to obtain a data enhancement sample.
Step 4: the public dataset USTC-TFC2016 is used, which contains 10 types of normal traffic and 10 types of malicious traffic. According to 8: and 2, dividing the training set and the testing set, removing labels in the training set sample, and training a self-supervision learning model by using the label-free sample. The model is used for constructing a convolutional neural network by referring to the resnet-50, extracting flow characteristics, wherein the model batch_size is 20, the learning rate is 0.001, and the iteration number is 20.
Step 5: freezing convolutional neural network parameters, adding a full-connection layer network model, respectively taking samples and labels (the proportion is randomly extracted for each type of flow) of 0.01%,0.02%,0.03%,0.04%,0.05%,0.10%,0.50% and 1.00% in a training set, training the full-connection layer network model, wherein the model batch_size is 20, the learning rate is 0.001, and the iteration number is 20.
Step 6: in order to embody the effectiveness of the self-supervision learning model, a convolutional neural network and a full-connection layer network model with the same structure as in the step 5 are constructed, network parameters are randomly initialized, and only 0.01%,0.02%,0.03%,0.04%,0.05%,0.10%,0.50% and 1.00% of samples and labels (the proportion is randomly extracted for each type of flow) in a training set are used for training the network, wherein the model batch_size is 20, the learning rate is 0.001 and the iteration number is 20.
Step 7: the classifier obtained in the step 5 and the classifier obtained in the step 6 are used for detecting and verifying malicious flow on a test set respectively, detection of malicious flow is achieved, detection results are shown in a table 1, and the result shows that the method can achieve good detection effect by only using a small number of labeled samples.
TABLE 1 malicious traffic detection effect
Proportion of labeled samples Number of labeled samples Self-supervision learning model accuracy Learning accuracy using only tagged data
0.01% 24 77.27% 55.27%
0.02% 62 90.73% 56.08%
0.03% 99 92.44% 71.33%
0.04% 133 95.25% 76.89%
0.05% 170 98.42% 81.95%
0.10% 350 98.47% 82.92%
0.50% 1794 98.03% 95.31%
1.00% 3596 98.75% 96.48%
* Since the flows of each class are randomly extracted according to the proportion, when the proportion of the labeled samples is too small, the extraction quantity of the samples of the flows of certain classes can be less than 1, and the samples are processed according to the extraction quantity of 0 samples in the case, when the proportion of the labeled samples is increased, the quantity of the labeled samples is not necessarily increased according to the same multiple
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (7)

1. A malicious flow detection method based on self-supervision learning under a small sample condition is characterized by comprising the following steps: comprises the steps of,
step 1: for a flow, extracting the length of a data packet, a data packet protocol and an arrival time interval of a specific number before constructing a feature matrix, and storing a flow sample for embedding the flow features in the step 2;
step 2: in view of the fact that the length, the protocol and the arrival time interval of the independent data packets lack context semantic information, the flow characteristics cannot be accurately expressed, a CBOW model is used, feature embedding operation is carried out on the data packet attributes by utilizing a specific number of data packets before and after each data packet, and three features of the length, the protocol and the arrival time interval of the data packets are respectively expanded to specific dimensions through the feature embedding operation, so that the flow characteristics are rich; the CBOW model consists of an input layer, a hidden layer and an output layer, and a weight matrix is obtained by training the CBOW model and is used for realizing feature embedding; counting different attribute values of all data packets, constructing an attribute value dictionary, wherein an input layer of CBOW is single-heat coding representation of a plurality of features, obtaining a hidden layer and an output layer through matrix calculation, obtaining probability of predicting the attribute value of the data packet according to the output layer, taking the attribute value with the maximum probability as output, calculating error of the attribute value and a true value, counter-propagating and updating network parameters, and after multiple iterations, using a network parameter matrix as a feature vector matrix for realizing feature embedding, wherein each row corresponds to one feature value, and expanding features into vectors with specific dimensions according to the matrix; according to the steps, a CBOW model for realizing feature embedding of the data packet length, the protocol and the arrival time interval is trained in sequence, a corresponding feature vector matrix is obtained, for each attribute value, the corresponding feature vector is matched, the attribute value which cannot be matched is replaced by a 0 vector, flow feature embedding is realized, one-dimensional features are expanded to high dimensions, context semantic information is fused, flow features are enriched, and flow expression is obtained;
step 3: in view of the probability that a malicious attacker increases the flow concealment in a flow confusion manner, partial information of the original flow is destroyed in a manner of adding a random data packet and increasing random time delay to the original flow, and the retained characteristics learn the correlation between the original flow and the confusion flow;
step 4: dividing all data sets into training sets and test sets according to a preset proportion, and taking all training set data without labels to realize the network training of the flow characteristic encoder; extracting deep vector representations from the original feature representations and the data-enhanced feature representations of each flow by using a convolutional neural network, calculating the similarity of all the deep vector representations, constructing a loss function by using the similarity, and carrying out back propagation iterative calculation on encoder network parameters to obtain a flow feature encoder network, wherein model pre-training is realized by using unlabeled flow samples only, and the efficiency of malicious flow detection model construction is improved;
step 5: connecting a full-connection layer after the feature encoder network is obtained by training in the step 4, fixing the encoder network parameters unchanged, randomly selecting each type of sample with a specific proportion in a training set, training the full-connection layer, defining a loss function as cross entropy, calculating the full-connection layer network parameters by using the back propagation of the loss function, and obtaining a trained malicious flow detection model after reaching a preset iteration number;
step 6: the malicious flow detection model is obtained by connecting the encoder network trained in the step 4 and the full connection layer trained in the step 5, only a small amount of tagged data is needed in the training process of the malicious flow detection model, and malicious flow detection under a small sample scene is realized through the malicious flow detection model.
2. The method for detecting malicious traffic under the condition of a small sample based on self-supervision learning as set forth in claim 1, wherein the method comprises the following steps: the implementation method of the step 1 is that,
for a stripe containing n packets p i Flow t= { p of (i.ltoreq.n) 1 ,p 2 ,…,p n Lift (V) } handleTaking the packet length (l) of the previous B (B is less than or equal to n) packets i ) Packet protocol (q i ) And packet arrival time interval (t i ) Constructing a 3*B flow characteristic matrix, and storing a flow sample for flow characteristic embedding in the step 2.
3. The method for detecting malicious traffic under the condition of a small sample based on self-supervision learning as set forth in claim 2, wherein the method comprises the following steps: the implementation method of the step 2 is that,
in view of the fact that the length, the protocol and the arrival time interval of the independent data packets lack context semantic information, the flow characteristics cannot be accurately expressed, a CBOW model is used, feature embedding operation is carried out on the data packet attributes by utilizing a specific number of data packets before and after each data packet, and three features of the length, the protocol and the arrival time interval of the data packets are respectively expanded to specific dimensions through the feature embedding operation, so that the flow characteristics are rich; the CBOW model consists of an input layer, a hidden layer and an output layer, and a weight matrix is obtained by training the model and is used for realizing feature embedding;
different attribute values of all data packets are counted, and an attribute value dictionary D= { S is constructed 1 ,S 2 ,…,S V Wherein V is the number of different attributes, and the input layer is a single-hot coded representation { χ ] of multiple features 12 ,…,χ C X, where x i (i.ltoreq.C) represents the single thermal encoding of the ith feature, and the hidden layer has the following calculation formula:
wherein W is a matrix of V x N, N is the dimension of the hidden layer, and the obtained hidden layer h passes through the matrix W' to obtain an output layer O:
o=W′ T *h
wherein W' is a matrix of N x V, and predicting the attribute value of the data packet as S according to the output layer k The probability is:
taking the attribute value with the maximum probability as output, calculating the error between the attribute value and the true value, and back-propagating to calculate W and W' matrixes, wherein the W matrix is a characteristic vector matrix after multiple iterations, and W k Corresponds to attribute S k Is a feature vector of (1);
according to the steps, feature vectors of the data packet length, the data packet protocol and the data packet reaching time interval are sequentially calculated, the corresponding feature vector is matched for each attribute value, the attribute value which cannot be matched is replaced by a 0 vector, flow feature embedding is achieved, one-dimensional features are expanded to high dimensions, context semantic information is fused, flow features are enriched, and flow expression is achieved.
4. A method for detecting malicious traffic under a small sample condition based on self-supervised learning as set forth in claim 3, wherein: the implementation method of the step 3 is that,
for each packet p in a traffic T i After which toIs added with a length l' i (0≤l′ i 1500) and simultaneously selecting a predetermined proportion of data packets by a simple random sampling method to increase random time delay delta t i (0≤Δt i Less than or equal to 0.2 s) to have an arrival time interval t i +Δt i Extracting the confused flow characteristics, obtaining a flow sample with enhanced data through characteristic embedding, enabling the flow sample with enhanced data to act on the encoder network training in the step 4, and learning the correlation between the original flow and the confused flow through the reserved characteristics to realize malicious flow detection with increased concealment through the confusion mode.
5. The method for detecting malicious traffic under the condition of a small sample based on self-supervision learning as set forth in claim 4, wherein the method comprises the following steps: the implementation method of the step 4 is that,
dividing all data sets into training sets and test sets according to a preset proportion, and taking all training set data which do not contain labels for self-supervision learning model training; for each flow T i Is used for extracting deep vector representation(s) by using a convolutional neural network i And s' i ) The method comprises the steps of carrying out a first treatment on the surface of the Calculating the similarity of all matrixes, wherein the similarity function is defined as:
where u and v are deep vector representations of two samples extracted by the convolutional neural network, respectively, and the loss function for any one traffic sample is defined as:
wherein Q is the total number of samples, and the convolutional neural network automatically learns the characteristic extraction mode of the flow samples according to the loss iterative calculation parameters; and (3) calculating encoder network parameters by using the back propagation iteration of the loss function, finally obtaining a flow characteristic encoder network, and realizing model pre-training by using only unlabeled flow samples, thereby improving the efficiency of malicious flow detection model construction.
6. The method for detecting malicious traffic under the condition of a small sample based on self-supervision learning as set forth in claim 5, wherein the method comprises the following steps: the implementation method of the step 5 is that,
adding a full-connection layer after a convolutional neural network layer, randomly selecting n% of each type of sample in a training set, training the full-connection layer, wherein a loss function is cross entropy loss:
where Q is the total number of samples, y i Representing the flow rate T i Is malicious traffic (y) i =1) or normal flow (y i =0),p i Representative flow T i Probability of being malicious traffic; and calculating the network parameters of the full-connection layer by using the back propagation of the loss function, and obtaining a trained malicious flow detection model after the iteration times are reached.
7. The method for detecting malicious traffic under the condition of a small sample based on self-supervision learning as set forth in claim 6, wherein the method comprises the following steps: and the convolutional neural network is selected from a resnet-50.
CN202310910097.9A 2023-07-10 2023-07-24 Small sample scene-oriented self-supervision learning malicious traffic detection method Pending CN117318980A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310839360X 2023-07-10
CN202310839360 2023-07-10

Publications (1)

Publication Number Publication Date
CN117318980A true CN117318980A (en) 2023-12-29

Family

ID=89296096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310910097.9A Pending CN117318980A (en) 2023-07-10 2023-07-24 Small sample scene-oriented self-supervision learning malicious traffic detection method

Country Status (1)

Country Link
CN (1) CN117318980A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614742A (en) * 2024-01-22 2024-02-27 广州大学 Malicious traffic detection method with enhanced honey point perception

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117614742A (en) * 2024-01-22 2024-02-27 广州大学 Malicious traffic detection method with enhanced honey point perception
CN117614742B (en) * 2024-01-22 2024-05-07 广州大学 Malicious traffic detection method with enhanced honey point perception

Similar Documents

Publication Publication Date Title
CN109284606B (en) Data flow anomaly detection system based on empirical characteristics and convolutional neural network
CN111340191B (en) Bot network malicious traffic classification method and system based on ensemble learning
CN113806746B (en) Malicious code detection method based on improved CNN (CNN) network
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN111144470A (en) Unknown network flow identification method and system based on deep self-encoder
CN115208680B (en) Dynamic network risk prediction method based on graph neural network
CN110533570A (en) A kind of general steganography method based on deep learning
CN112422531A (en) CNN and XGboost-based network traffic abnormal behavior detection method
CN113489751A (en) Network traffic filtering rule conversion method based on deep learning
CN110189167A (en) A kind of moving advertising fraud detection method based on the insertion of isomery figure
He et al. Deep‐Feature‐Based Autoencoder Network for Few‐Shot Malicious Traffic Detection
CN113269228B (en) Method, device and system for training graph network classification model and electronic equipment
CN115277888B (en) Method and system for analyzing message type of mobile application encryption protocol
CN117318980A (en) Small sample scene-oriented self-supervision learning malicious traffic detection method
CN115277086B (en) Network background flow generation method based on generation of countermeasure network
CN114172688A (en) Encrypted traffic network threat key node automatic extraction method based on GCN-DL
CN111224998B (en) Botnet identification method based on extreme learning machine
CN111130942B (en) Application flow identification method based on message size analysis
CN117082118A (en) Network connection method based on data derivation and port prediction
CN115361176A (en) SQL injection attack detection method based on FlexUDA model
CN109450876B (en) DDos identification method and system based on multi-dimensional state transition matrix characteristics
CN116684133A (en) SDN network abnormal flow classification device and method based on double-layer attention and space-time feature parallel fusion
CN116781341A (en) Decentralised network DDoS attack identification method based on large language model
CN116418565A (en) Domain name detection method based on attribute heterograph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination