CN114970680A - CNN + LSTM-based flow terminal real-time identification method and device - Google Patents

CNN + LSTM-based flow terminal real-time identification method and device Download PDF

Info

Publication number
CN114970680A
CN114970680A CN202210459253.XA CN202210459253A CN114970680A CN 114970680 A CN114970680 A CN 114970680A CN 202210459253 A CN202210459253 A CN 202210459253A CN 114970680 A CN114970680 A CN 114970680A
Authority
CN
China
Prior art keywords
flow
lstm
model
cnn
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210459253.XA
Other languages
Chinese (zh)
Inventor
宁焕生
魏大为
万月亮
李莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202210459253.XA priority Critical patent/CN114970680A/en
Publication of CN114970680A publication Critical patent/CN114970680A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a CNN + LSTM-based flow terminal real-time identification method and a CNN + LSTM-based flow terminal real-time identification device, wherein the flow terminal real-time identification method comprises the following steps: recombining a Transmission Control Protocol (TCP) session; extracting flow characteristics from the conversation, and preprocessing the extracted flow characteristics; constructing a deep learning model combining a convolutional neural network CNN and a long and short memory neural network LSTM; constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model by using the sample data set in a transfer learning mode to obtain a classifier; and carrying out flow classification and marking by using the trained classifier. The flow terminal real-time identification method of the invention is based on CNN + LSTM and utilizes the idea of transfer learning. Through learning of the flow statistic characteristics and the flow user characteristics, real-time classification of the flow terminals is achieved.

Description

CNN + LSTM-based flow terminal real-time identification method and device
Technical Field
The invention relates to the technical field of network data stream processing, in particular to a CNN + LSTM-based flow terminal real-time identification method and device.
Background
With the closer and closer connection between the network space and the production and life of human beings, the network space has become the fifth territory after the land, sea, air and day, and the network space management is a great requirement for the security and social stability of the concerned countries. The basic mode of network information transmission is to communicate between network devices in the form of network messages, and to realize the transmission of different network contents by using different network protocols. Therefore, the basic requirement of network space governance is to distinguish different users, applications, and contents and then adopt different ways to do governance. One of the important points is to determine which terminal device generated the traffic packet. Since application traffic from different terminals is mixed on the same line, it is difficult for a network administrator to simply distinguish traffic of different users. Meanwhile, with the use of a large amount of encrypted network protocols, less and less terminal information can be extracted by using a content extraction means, and refined network space management is difficult to perform.
In the conventional management method, traffic generated by different terminals can be distinguished by identifying IP address information or MAC information included in a traffic packet. However, with the heavy use of mobile devices and the increasing use of MAC address obfuscation operations by more and more devices, the IP address and the MAC address corresponding to the same terminal in the traffic packet may change continuously. The difficulty of identifying the terminal to which the traffic belongs in the traditional mode is gradually increased. How to adopt an effective method to identify the terminal problem to which the network flow belongs is a key problem to be solved urgently in network space treatment.
Currently, a machine learning algorithm is often used for research aiming at traffic terminal classification. The machine learning algorithm learns the traffic of the existing label to obtain the corresponding label characteristics, and completes the traffic classification of different levels. The machine learning classification method based on the flow statistical characteristics does not relate to specific contents of protocol loads, but describes the protocols on the dimensionality of data flow behaviors, and has good performance in the known protocol classification. Since each process produces a flow having different flow statistics, the analysis of the flow statistics can distinguish between different applications. However, for this kind of method, due to the operations of caching the traffic packets, session reassembly, and flow statistics calculation, there is a high complexity, cost, and load when processing high-speed traffic. Since the classifier based on flow statistics is for different flow statistics, different flow statistics need to be selected for different protocol combinations. These techniques overcome the drawbacks of deep packet inspection based techniques because they avoid inspection of the packet contents. This feature allows statistical classification analysis of encrypted traffic.
The prior art provides a terminal tracing method based on an XGboost model. According to the method, data stream conversation is recombined, user agent information, ID information, timestamp information, conversation information and the like in the conversation are extracted, and the characteristics of the conversation are obtained through calculation. And learning the session containing the terminal information by using the XGboost, and finally tracing the traffic. The method realizes finding the terminal characteristics from the basic information of the network session, and realizes the terminal tracing of the flow finally by learning and identifying the terminal characteristics through XGboost.
Although the basic traffic tracing work is realized by the method, the method has the following defects that firstly, the XGboost algorithm is poor in feature processing on discrete values, and all features extracted from the UserAgents are discrete features. Secondly, when the XGboost algorithm is used for processing the classification problem of multi-terminal traffic mixing, the multi-classification problem is converted into a plurality of two-classification problems to be solved, so that when the algorithm is used for processing the actual classification problem, the model is large, and the model complexity is high. More importantly, the method does not solve the problem of real-time classification of the flow, the algorithm needs to cache data for a period of time, and the label data in the data set is used for training to further classify the label-free data. Therefore, it is difficult to realize in a practical scenario of a large flow rate.
Disclosure of Invention
The invention provides a CNN + LSTM-based flow terminal real-time identification method and device, aiming at solving the problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete characteristics as continuous characteristics, so that a large error is caused; in the prior art, the model is large in scale and slow in calculation due to the fact that a multi-classification problem is converted into a plurality of two-classification problems; and the technical problem that the prior art is difficult to solve the real-time classification of the flow.
In order to solve the technical problems, the invention provides the following technical scheme:
on one hand, the invention provides a CNN + LSTM-based traffic terminal real-time identification method, which comprises the following steps:
recombining a Transmission Control Protocol (TCP) session;
extracting flow characteristics from the conversation, and preprocessing the extracted flow characteristics;
constructing a deep learning model combining a convolutional neural network CNN and a long and short memory neural network LSTM;
constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model by using the sample data set in a transfer learning mode to obtain a classifier;
and carrying out flow classification and marking by using the trained classifier.
Further, the reassembling the transmission control protocol TCP session includes:
extracting a source IP address, a source port, a destination IP address, a destination port and a transport layer protocol number in the flow packet, and classifying TCP messages of different sessions;
and sequencing the messages according to the seq information in the messages and deleting the repeated data packets.
Further, the extracting the traffic characteristics from the session includes:
extracting IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN and TCP-MSS information from a SYN packet of a TCP session; extracting Useragent information from a message containing an HTTP request; counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages and the session duration based on the whole session;
from the UserAgents, the extracted information is processed into a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model using a user _ agent package in python.
Further, the preprocessing the extracted flow characteristics includes:
performing data cleaning on the extracted flow characteristics, and eliminating data with the downlink flow of 0;
carrying out data standardization on the data subjected to data cleaning by adopting a preset data standardization algorithm;
the discrete characteristics expressed by four types of texts, namely the browser type, the browser model, the operating system type and the operating system model, are processed into 26-dimensional characteristics by using an OneHotEncoder;
for device type and device model, the 2-dimensional feature is processed using a LabelEncoder.
Further, constructing a deep learning model of combination of CNN and LSTM, comprising:
using TensorFlow to build a sequential neural network, adopting a one-dimensional convolutional layer and a batch normalization layer to extract data characteristics, and using a one-dimensional maximum pooling layer to perform characteristic selection and information filtering;
adding two LSTM layers after the convolution layer to further learn the terminal characteristics;
and finally adding a full connection layer and an output layer.
Further, training the model by using the sample data set to obtain a classifier, including:
taking 100 terminals as a group, and collecting 200 pieces of label data by each terminal;
collecting 20 groups of sample data as pre-training data; pre-training the model by using the collected sample data; when each group of pre-training is performed, the output layer is initialized again; training each group of data for 200 rounds;
removing an output layer, storing a pre-training model, and finishing pre-training of the classifier;
collecting marked terminal sessions within a period of time, and retraining the classifier by taking the marked terminal sessions as a training set;
loading a pre-training model;
adding output layers with the number equal to that of the terminals;
training the model for 200 rounds;
and storing the classifier, and finishing the retraining of the classifier to obtain the trained classifier.
Further, the classifying and labeling the traffic by using the trained classifier includes:
loading the trained classifier;
classifying the acquired flow by using the trained classifier based on the preprocessed flow characteristics;
and receiving the classification result larger than the preset threshold value as the flow of the corresponding terminal, and marking the original flow.
On the other hand, the invention also provides a CNN + LSTM-based traffic terminal real-time identification device, which comprises:
the session recombination module is used for recombining the Transmission Control Protocol (TCP) session;
the flow characteristic extraction and processing module is used for extracting flow characteristics from the conversation and preprocessing the extracted flow characteristics;
the deep learning model building module is used for building a deep learning model combining the convolutional neural network CNN and the long and short memory neural network LSTM;
the model training module is used for constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model constructed by the deep learning model construction module by using the sample data set constructed by the flow characteristic extraction and processing module in a transfer learning mode to obtain a classifier;
and the flow classification and marking module is used for classifying and marking the flow by utilizing the trained classifier.
In yet another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the above-described method.
In yet another aspect, the present invention also provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the above-mentioned method.
The technical scheme provided by the invention has the beneficial effects that at least:
the traditional DPI technology cannot realize better traffic terminal classification in the current situation that the current encrypted protocol traffic is prevalent. The invention uses CNN + LSTM as the basis and utilizes the idea of transfer learning. Through the learning of the traffic statistical characteristics and the traffic usergent characteristics, the traffic terminal classification is realized. The problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete features as continuous features, so that a large error is caused; in the prior art, the XGboost algorithm converts a multi-classification problem into a plurality of two-classification problems, so that the model is large in scale and slow in calculation; and the technical problem that the prior art is difficult to solve the real-time classification of the flow is solved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of an execution flow of a CNN + LSTM-based traffic terminal real-time identification method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a traffic terminal classification model based on CNN + LSTM according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
First embodiment
The embodiment provides a CNN + LSTM-based traffic terminal real-time identification method, and aims to solve the problems that in the prior art, discrete features are treated as continuous features to cause errors, and in the prior art, a plurality of two-classification models are used for solving the problem of resource waste caused by multi-terminal traffic classification. Aiming at the problem that the existing scheme is difficult to solve the real-time flow classification, the method adopts a transfer learning mode to pre-train the model, greatly reduces the number of label samples required by retraining, and improves the algorithm accuracy.
Based on the above, the execution flow of the method is shown in fig. 1, and includes the following steps:
TCP session reassembly: and operations such as rapid TCP session recombination and repeated packet elimination are realized.
The method specifically comprises the following steps: and extracting a source IP address, a source port, a destination IP address, a destination port and a transport layer protocol number in the flow packet, and classifying the TCP messages of different sessions. And according to the seq information in the message, sequencing the message, deleting the repeated data packet and the like.
2. Flow characteristic extraction: and extracting and calculating corresponding features from the conversation according to the feature selection requirement.
The method comprises the following specific steps: as shown in Table 1, the embodiment extracts information such as IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN, TCP-MSS, etc. from the SYN packet of the TCP session. Useragent information is extracted from a message containing an HTTP request. And counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages, the session duration and other characteristics based on the whole session. Terminal related information is extracted from partial network stream contents through tools such as deep packet inspection and the like, and the session is subjected to terminal marking.
TABLE 1 extracted protocol features
Source-protocol field Source-message
HTTP-UA First HTTP request message in quintuple
IP quintuple SYN message
IP-Time to live SYN message
IP-ID SYN message
TCP-Window Size SYN message
TCP-ISN SYN message
Byte number of uplink message Session statistics based on whole sessions
Byte number of downlink message Session statistics based on whole sessions
Duration of a conversation Session statistics based on whole sessions
Number of uplink messages Session statistics based on whole sessions
Number of downlink messages Session statistics based on whole sessions
3. Preprocessing the flow characteristics: and cleaning, standardizing, recoding and the like are carried out on the collected characteristics according to different types. The specific treatment process is as follows:
a) data cleansing
Data cleansing is the process of re-examining and verifying data with the aim of deleting duplicate information, correcting existing errors and providing a data consistency check. For the characteristics of the internet traffic data collected in this embodiment, the output file generated by the stream analysis tool is analyzed, and the generated stream is transferred to the DataFrame. Since the streams generated by different terminals are to be classified, the present embodiment eliminates data with a downlink traffic of 0. This part of the data is typically a half-join that is not generated in response to a request on the other side and is of no analytical value.
b) Data normalization
The purpose of data normalization is mainly to avoid the problem that large-amplitude features submerge small-amplitude features due to the dynamic value range of different features. Commonly used normalization methods are Z-score normalization, maximum-minimum normalization, etc. In this embodiment, a maximum-minimum normalization method is adopted, and a linear transformation is performed on original data to map data values between [0,1], where the formula is as follows:
Figure BDA0003616916530000061
c) discrete feature processing
In order to further mine the terminal characteristics contained in the traffic, the embodiment extracts the user agent information in the TCP stream, and uses the user _ agent packet in python to process the extracted information into 6 values, i.e., a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model, from the user agent. For the discrete features represented by such texts, the discrete features are processed by two methods, namely a common onehotencor method and a common labeleencor method, and are converted into numerical expression. The method specifically comprises the following steps: the OneHotEncoder is processed into 26-dimensional characteristics for discrete characteristics represented by four types of texts, namely the browser type, the browser model, the operating system type and the operating system model, and the LabelEncoder is processed into 2-dimensional characteristics for the equipment type and the equipment model.
4. Constructing a deep learning model combining CNN and LSTM as shown in FIG. 2; the method comprises the following specific steps:
a) a sequential neural network is built by using TensorFlow, data characteristics are extracted by adopting a one-dimensional convolutional layer and a batch normalization layer, and the influence of decimal values on a model is reduced by using a one-dimensional maximum pooling layer;
b) adding two LSTM layers after the convolution layer to further learn the terminal characteristics;
c) and finally adding a full connection layer and an output layer.
It should be noted that, in the present embodiment, a classifier is built in a manner of a one-dimensional convolutional neural network + a recurrent neural network. The pre-classifier with a good classification effect is pre-trained by using a transfer learning method, and the good classification effect can be achieved by using less data when a specific terminal is classified.
The convolutional neural network is a feedforward neural network comprising a convolutional calculation structure and a depth structure, has the capability of representation learning, and can effectively extract the features of data from chaotic data without more feature selection steps.
The convolutional neural network mainly comprises a convolutional layer, a pooling layer, a full-connection layer and the like. The convolutional layer is the core of the convolutional neural network, the convolutional core is used for performing feature extraction on input data, and the formula of the one-dimensional convolutional core is as follows:
Figure BDA0003616916530000071
after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. Maximum pooling (max-pooling), as used herein, is the following one-dimensional maximum pooling function:
Figure BDA0003616916530000072
in order to prevent the problem of inconsistent data distribution of each layer in a deep convolutional neural network, a tensor obtained after convolution is processed by using Batch Normalization (BN), and the BN layer can accelerate the training and convergence speed of the network, control gradient explosion to prevent gradient disappearance and prevent overfitting. The algorithm of the BN layer is as follows:
Figure BDA0003616916530000073
Figure BDA0003616916530000074
Figure BDA0003616916530000081
Figure BDA0003616916530000082
at the output layer we use SoftMax as the output layer according to the label of the terminal based on the lab le code we use. The probability that each element yi is selected in SoftMax is:
Figure BDA0003616916530000083
5. training a classifier: constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; the model is trained by adopting a transfer learning mode and utilizing a sample data set, and the method specifically comprises the following steps:
a) pre-training a classifier: collecting label data, and performing pre-training, wherein the method specifically comprises the following steps:
(1) with 1 group of 100 terminals, 200 pieces of tag data were collected per terminal.
(2) 20 sets of data were collected as pre-training data.
(3) Pre-training is performed using the collected data. Wherein, when each group of pre-training is performed, the output layer is reinitialized. Each set of data was trained for 200 rounds.
(4) And removing an output layer and storing the pre-training model.
b) And (3) retraining the classifier: and (5) performing retraining by using the data of the current scene to obtain a reliable classifier.
(1) And collecting the marked terminal sessions in a period of time, taking the collected marked terminal sessions in the period of time as a training set, and retraining the classifier.
(2) And loading a pre-training model.
(3) Adding output layers equal to the number of terminals.
(4) Train 200 rounds.
(5) The classifier is stored.
6. Carrying out flow classification and marking by using the trained classifier, which comprises the following steps:
a) and loading the trained classifier.
b) And classifying the collected flow by using the trained classifier.
c) And receiving the classification result larger than the preset threshold value as the flow of the corresponding terminal.
d) And marking the original flow.
To sum up, the embodiment uses the idea of transfer learning based on CNN + LSTM. Through the learning of the flow statistic characteristics and the flow user characteristics, the terminal classification of the flow is realized. The problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete features as continuous features, so that a large error is caused; in the prior art, the XGboost algorithm converts a multi-classification problem into a plurality of two-classification problems, so that the model is large in scale and slow in calculation; and the technical problem that the prior art is difficult to solve the real-time classification of the flow.
Second embodiment
The embodiment provides a CNN + LSTM-based traffic terminal real-time identification device, which includes the following modules:
the session recombination module is used for recombining the Transmission Control Protocol (TCP) session;
the flow characteristic extraction and processing module is used for extracting flow characteristics from the conversation and preprocessing the extracted flow characteristics;
the deep learning model building module is used for building a deep learning model combining the convolutional neural network CNN and the long and short memory neural network LSTM;
the model training module is used for constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model constructed by the deep learning model construction module by using the sample data set constructed by the flow characteristic extraction and processing module in a transfer learning mode to obtain a classifier;
and the flow classification and marking module is used for classifying and marking the flow by utilizing the trained classifier.
The CNN + LSTM-based traffic terminal real-time identification apparatus of this embodiment corresponds to the CNN + LSTM-based traffic terminal real-time identification method of the first embodiment; the functions realized by each functional module in the CNN + LSTM-based flow terminal real-time identification device correspond to each flow step in the CNN + LSTM-based flow terminal real-time identification method one by one; therefore, it will not be described herein.
Third embodiment
The present embodiment provides an electronic device, which includes a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.
The electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) and one or more memories, where at least one instruction is stored in the memory, and the instruction is loaded by the processor and executes the method.
Fourth embodiment
The present embodiment provides a computer-readable storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the method of the first embodiment. The computer readable storage medium may be, among others, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The instructions stored therein may be loaded by a processor in the terminal and perform the above-described method.
Furthermore, it should be noted that the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (8)

1. A CNN + LSTM-based flow terminal real-time identification method is characterized by comprising the following steps:
recombining a Transmission Control Protocol (TCP) session;
extracting flow characteristics from the conversation, and preprocessing the extracted flow characteristics;
constructing a deep learning model combining a convolutional neural network CNN and a long and short memory neural network LSTM;
constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model by using the sample data set in a transfer learning mode to obtain a classifier;
and carrying out flow classification and marking by using the trained classifier.
2. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein said reassembling of the TCP session of the transmission control protocol comprises:
extracting a source IP address, a source port, a destination IP address, a destination port and a transport layer protocol number in the flow packet, and classifying TCP messages of different sessions;
and sequencing the messages according to the seq information in the messages and deleting the repeated data packets.
3. The CNN + LSTM-based traffic terminal real-time identification method of claim 1, wherein the extracting traffic features from the session comprises:
extracting IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN and TCP-MSS information from a SYN packet of a TCP session; extracting user agent information from a message containing an HTTP request; counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages and the session duration based on the whole session;
from the UserAgents, the extracted information is processed into a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model using the user _ agent package in python.
4. The CNN + LSTM-based traffic terminal real-time identification method of claim 3, wherein the preprocessing the extracted traffic features comprises:
performing data cleaning on the extracted flow characteristics, and eliminating data with the downlink flow of 0;
carrying out data standardization on the data subjected to data cleaning by adopting a preset data standardization algorithm;
the discrete characteristics expressed by four types of texts, namely the browser type, the browser model, the operating system type and the operating system model, are processed into 26-dimensional characteristics by using an OneHotEncoder;
for device type and device model, the 2-dimensional feature is processed using a LabelEncoder.
5. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein constructing a deep learning model combining CNN and LSTM includes:
using TensorFlow to build a sequential neural network, adopting a one-dimensional convolutional layer and a batch normalization layer to extract data characteristics, and using a one-dimensional maximum pooling layer to perform characteristic selection and information filtering;
adding two LSTM layers after the convolution layer to further learn the terminal characteristics;
and finally adding a full connection layer and an output layer.
6. The CNN + LSTM-based traffic terminal real-time identification method of claim 1, wherein training the model using the sample data set to obtain a classifier comprises:
taking 100 terminals as a group, and collecting 200 pieces of label data by each terminal;
collecting 20 groups of sample data as pre-training data; pre-training the model by using the collected sample data; when each group of pre-training is performed, the output layer is initialized again; training each group of data for 200 rounds;
removing an output layer, storing a pre-training model, and finishing pre-training of the classifier;
collecting marked terminal sessions within a period of time, and retraining the classifier by taking the marked terminal sessions as a training set;
loading a pre-training model;
adding output layers with the number equal to that of the terminals;
training the model for 200 rounds;
and storing the classifier, and finishing the retraining of the classifier to obtain the trained classifier.
7. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein the classifying and labeling the traffic using the trained classifier comprises:
loading the trained classifier;
classifying the acquired flow by using the trained classifier based on the preprocessed flow characteristics;
and receiving the classification result larger than the preset threshold value as the flow of the corresponding terminal, and marking the original flow.
8. A CNN + LSTM-based flow terminal real-time identification device is characterized by comprising:
the session recombination module is used for recombining the Transmission Control Protocol (TCP) session;
the flow characteristic extraction and processing module is used for extracting flow characteristics from the conversation and preprocessing the extracted flow characteristics;
the deep learning model building module is used for building a deep learning model combining the convolutional neural network CNN and the long and short memory neural network LSTM;
the model training module is used for constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model constructed by the deep learning model construction module by using the sample data set constructed by the flow characteristic extraction and processing module in a transfer learning mode to obtain a classifier;
and the flow classification and marking module is used for classifying and marking the flow by utilizing the trained classifier.
CN202210459253.XA 2022-04-26 2022-04-26 CNN + LSTM-based flow terminal real-time identification method and device Pending CN114970680A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210459253.XA CN114970680A (en) 2022-04-26 2022-04-26 CNN + LSTM-based flow terminal real-time identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210459253.XA CN114970680A (en) 2022-04-26 2022-04-26 CNN + LSTM-based flow terminal real-time identification method and device

Publications (1)

Publication Number Publication Date
CN114970680A true CN114970680A (en) 2022-08-30

Family

ID=82979304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210459253.XA Pending CN114970680A (en) 2022-04-26 2022-04-26 CNN + LSTM-based flow terminal real-time identification method and device

Country Status (1)

Country Link
CN (1) CN114970680A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134176A (en) * 2022-09-02 2022-09-30 南京航空航天大学 Hidden network encrypted traffic classification method based on incomplete supervision

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134176A (en) * 2022-09-02 2022-09-30 南京航空航天大学 Hidden network encrypted traffic classification method based on incomplete supervision

Similar Documents

Publication Publication Date Title
CN112163594B (en) Network encryption traffic identification method and device
CN109361617B (en) Convolutional neural network traffic classification method and system based on network packet load
CN111340191B (en) Bot network malicious traffic classification method and system based on ensemble learning
CN109063777B (en) Net flow assorted method, apparatus and realization device
CN111191767B (en) Vectorization-based malicious traffic attack type judging method
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN112468347B (en) Security management method and device for cloud platform, electronic equipment and storage medium
CN110868404B (en) Industrial control equipment automatic identification method based on TCP/IP fingerprint
WO2015154484A1 (en) Traffic data classification method and device
CN111970400B (en) Crank call identification method and device
Wang et al. Using CNN-based representation learning method for malicious traffic identification
CN112019500B (en) Encrypted traffic identification method based on deep learning and electronic device
CN111224998B (en) Botnet identification method based on extreme learning machine
CN114970680A (en) CNN + LSTM-based flow terminal real-time identification method and device
CN112884121A (en) Traffic identification method based on generation of confrontation deep convolutional network
Yujie et al. End-to-end android malware classification based on pure traffic images
CN115238799A (en) AI-based random forest malicious traffic detection method and system
CN114650229A (en) Network encryption traffic classification method and system based on three-layer model SFTF-L
CN114338437B (en) Network traffic classification method and device, electronic equipment and storage medium
Yang et al. Deep learning-based reverse method of binary protocol
CN115473734A (en) Remote code execution attack detection method based on single classification and federal learning
CN113852605A (en) Protocol format automatic inference method and system based on relational reasoning
Li et al. Solving the data imbalance problem in network intrusion detection: A MP-CVAE based method
CN115622810B (en) Business application identification system and method based on machine learning algorithm
CN113378899B (en) Abnormal account identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination