CN114970680A

CN114970680A - CNN + LSTM-based flow terminal real-time identification method and device

Info

Publication number: CN114970680A
Application number: CN202210459253.XA
Authority: CN
Inventors: 宁焕生; 魏大为; 万月亮; 李莎
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2022-08-30

Abstract

The invention discloses a CNN + LSTM-based flow terminal real-time identification method and a CNN + LSTM-based flow terminal real-time identification device, wherein the flow terminal real-time identification method comprises the following steps: recombining a Transmission Control Protocol (TCP) session; extracting flow characteristics from the conversation, and preprocessing the extracted flow characteristics; constructing a deep learning model combining a convolutional neural network CNN and a long and short memory neural network LSTM; constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model by using the sample data set in a transfer learning mode to obtain a classifier; and carrying out flow classification and marking by using the trained classifier. The flow terminal real-time identification method of the invention is based on CNN + LSTM and utilizes the idea of transfer learning. Through learning of the flow statistic characteristics and the flow user characteristics, real-time classification of the flow terminals is achieved.

Description

CNN + LSTM-based flow terminal real-time identification method and device

Technical Field

The invention relates to the technical field of network data stream processing, in particular to a CNN + LSTM-based flow terminal real-time identification method and device.

Background

With the closer and closer connection between the network space and the production and life of human beings, the network space has become the fifth territory after the land, sea, air and day, and the network space management is a great requirement for the security and social stability of the concerned countries. The basic mode of network information transmission is to communicate between network devices in the form of network messages, and to realize the transmission of different network contents by using different network protocols. Therefore, the basic requirement of network space governance is to distinguish different users, applications, and contents and then adopt different ways to do governance. One of the important points is to determine which terminal device generated the traffic packet. Since application traffic from different terminals is mixed on the same line, it is difficult for a network administrator to simply distinguish traffic of different users. Meanwhile, with the use of a large amount of encrypted network protocols, less and less terminal information can be extracted by using a content extraction means, and refined network space management is difficult to perform.

In the conventional management method, traffic generated by different terminals can be distinguished by identifying IP address information or MAC information included in a traffic packet. However, with the heavy use of mobile devices and the increasing use of MAC address obfuscation operations by more and more devices, the IP address and the MAC address corresponding to the same terminal in the traffic packet may change continuously. The difficulty of identifying the terminal to which the traffic belongs in the traditional mode is gradually increased. How to adopt an effective method to identify the terminal problem to which the network flow belongs is a key problem to be solved urgently in network space treatment.

Currently, a machine learning algorithm is often used for research aiming at traffic terminal classification. The machine learning algorithm learns the traffic of the existing label to obtain the corresponding label characteristics, and completes the traffic classification of different levels. The machine learning classification method based on the flow statistical characteristics does not relate to specific contents of protocol loads, but describes the protocols on the dimensionality of data flow behaviors, and has good performance in the known protocol classification. Since each process produces a flow having different flow statistics, the analysis of the flow statistics can distinguish between different applications. However, for this kind of method, due to the operations of caching the traffic packets, session reassembly, and flow statistics calculation, there is a high complexity, cost, and load when processing high-speed traffic. Since the classifier based on flow statistics is for different flow statistics, different flow statistics need to be selected for different protocol combinations. These techniques overcome the drawbacks of deep packet inspection based techniques because they avoid inspection of the packet contents. This feature allows statistical classification analysis of encrypted traffic.

The prior art provides a terminal tracing method based on an XGboost model. According to the method, data stream conversation is recombined, user agent information, ID information, timestamp information, conversation information and the like in the conversation are extracted, and the characteristics of the conversation are obtained through calculation. And learning the session containing the terminal information by using the XGboost, and finally tracing the traffic. The method realizes finding the terminal characteristics from the basic information of the network session, and realizes the terminal tracing of the flow finally by learning and identifying the terminal characteristics through XGboost.

Although the basic traffic tracing work is realized by the method, the method has the following defects that firstly, the XGboost algorithm is poor in feature processing on discrete values, and all features extracted from the UserAgents are discrete features. Secondly, when the XGboost algorithm is used for processing the classification problem of multi-terminal traffic mixing, the multi-classification problem is converted into a plurality of two-classification problems to be solved, so that when the algorithm is used for processing the actual classification problem, the model is large, and the model complexity is high. More importantly, the method does not solve the problem of real-time classification of the flow, the algorithm needs to cache data for a period of time, and the label data in the data set is used for training to further classify the label-free data. Therefore, it is difficult to realize in a practical scenario of a large flow rate.

Disclosure of Invention

The invention provides a CNN + LSTM-based flow terminal real-time identification method and device, aiming at solving the problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete characteristics as continuous characteristics, so that a large error is caused; in the prior art, the model is large in scale and slow in calculation due to the fact that a multi-classification problem is converted into a plurality of two-classification problems; and the technical problem that the prior art is difficult to solve the real-time classification of the flow.

In order to solve the technical problems, the invention provides the following technical scheme:

on one hand, the invention provides a CNN + LSTM-based traffic terminal real-time identification method, which comprises the following steps:

recombining a Transmission Control Protocol (TCP) session;

extracting flow characteristics from the conversation, and preprocessing the extracted flow characteristics;

constructing a deep learning model combining a convolutional neural network CNN and a long and short memory neural network LSTM;

constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model by using the sample data set in a transfer learning mode to obtain a classifier;

and carrying out flow classification and marking by using the trained classifier.

Further, the reassembling the transmission control protocol TCP session includes:

extracting a source IP address, a source port, a destination IP address, a destination port and a transport layer protocol number in the flow packet, and classifying TCP messages of different sessions;

and sequencing the messages according to the seq information in the messages and deleting the repeated data packets.

Further, the extracting the traffic characteristics from the session includes:

extracting IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN and TCP-MSS information from a SYN packet of a TCP session; extracting Useragent information from a message containing an HTTP request; counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages and the session duration based on the whole session;

from the UserAgents, the extracted information is processed into a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model using a user _ agent package in python.

Further, the preprocessing the extracted flow characteristics includes:

performing data cleaning on the extracted flow characteristics, and eliminating data with the downlink flow of 0;

carrying out data standardization on the data subjected to data cleaning by adopting a preset data standardization algorithm;

the discrete characteristics expressed by four types of texts, namely the browser type, the browser model, the operating system type and the operating system model, are processed into 26-dimensional characteristics by using an OneHotEncoder;

for device type and device model, the 2-dimensional feature is processed using a LabelEncoder.

Further, constructing a deep learning model of combination of CNN and LSTM, comprising:

using TensorFlow to build a sequential neural network, adopting a one-dimensional convolutional layer and a batch normalization layer to extract data characteristics, and using a one-dimensional maximum pooling layer to perform characteristic selection and information filtering;

adding two LSTM layers after the convolution layer to further learn the terminal characteristics;

and finally adding a full connection layer and an output layer.

Further, training the model by using the sample data set to obtain a classifier, including:

taking 100 terminals as a group, and collecting 200 pieces of label data by each terminal;

collecting 20 groups of sample data as pre-training data; pre-training the model by using the collected sample data; when each group of pre-training is performed, the output layer is initialized again; training each group of data for 200 rounds;

removing an output layer, storing a pre-training model, and finishing pre-training of the classifier;

collecting marked terminal sessions within a period of time, and retraining the classifier by taking the marked terminal sessions as a training set;

loading a pre-training model;

adding output layers with the number equal to that of the terminals;

training the model for 200 rounds;

and storing the classifier, and finishing the retraining of the classifier to obtain the trained classifier.

Further, the classifying and labeling the traffic by using the trained classifier includes:

loading the trained classifier;

classifying the acquired flow by using the trained classifier based on the preprocessed flow characteristics;

and receiving the classification result larger than the preset threshold value as the flow of the corresponding terminal, and marking the original flow.

On the other hand, the invention also provides a CNN + LSTM-based traffic terminal real-time identification device, which comprises:

the session recombination module is used for recombining the Transmission Control Protocol (TCP) session;

the flow characteristic extraction and processing module is used for extracting flow characteristics from the conversation and preprocessing the extracted flow characteristics;

the deep learning model building module is used for building a deep learning model combining the convolutional neural network CNN and the long and short memory neural network LSTM;

the model training module is used for constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; training the model constructed by the deep learning model construction module by using the sample data set constructed by the flow characteristic extraction and processing module in a transfer learning mode to obtain a classifier;

and the flow classification and marking module is used for classifying and marking the flow by utilizing the trained classifier.

In yet another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the above-mentioned method.

The technical scheme provided by the invention has the beneficial effects that at least:

the traditional DPI technology cannot realize better traffic terminal classification in the current situation that the current encrypted protocol traffic is prevalent. The invention uses CNN + LSTM as the basis and utilizes the idea of transfer learning. Through the learning of the traffic statistical characteristics and the traffic usergent characteristics, the traffic terminal classification is realized. The problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete features as continuous features, so that a large error is caused; in the prior art, the XGboost algorithm converts a multi-classification problem into a plurality of two-classification problems, so that the model is large in scale and slow in calculation; and the technical problem that the prior art is difficult to solve the real-time classification of the flow is solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of an execution flow of a CNN + LSTM-based traffic terminal real-time identification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a traffic terminal classification model based on CNN + LSTM according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

First embodiment

The embodiment provides a CNN + LSTM-based traffic terminal real-time identification method, and aims to solve the problems that in the prior art, discrete features are treated as continuous features to cause errors, and in the prior art, a plurality of two-classification models are used for solving the problem of resource waste caused by multi-terminal traffic classification. Aiming at the problem that the existing scheme is difficult to solve the real-time flow classification, the method adopts a transfer learning mode to pre-train the model, greatly reduces the number of label samples required by retraining, and improves the algorithm accuracy.

Based on the above, the execution flow of the method is shown in fig. 1, and includes the following steps:

TCP session reassembly: and operations such as rapid TCP session recombination and repeated packet elimination are realized.

The method specifically comprises the following steps: and extracting a source IP address, a source port, a destination IP address, a destination port and a transport layer protocol number in the flow packet, and classifying the TCP messages of different sessions. And according to the seq information in the message, sequencing the message, deleting the repeated data packet and the like.

2. Flow characteristic extraction: and extracting and calculating corresponding features from the conversation according to the feature selection requirement.

The method comprises the following specific steps: as shown in Table 1, the embodiment extracts information such as IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN, TCP-MSS, etc. from the SYN packet of the TCP session. Useragent information is extracted from a message containing an HTTP request. And counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages, the session duration and other characteristics based on the whole session. Terminal related information is extracted from partial network stream contents through tools such as deep packet inspection and the like, and the session is subjected to terminal marking.

TABLE 1 extracted protocol features

Source-protocol field	Source-message
		HTTP-UA	First HTTP request message in quintuple
IP quintuple	SYN message
		IP-Time to live	SYN message
IP-ID	SYN message
		TCP-Window Size	SYN message
TCP-ISN	SYN message
		Byte number of uplink message	Session statistics based on whole sessions
Byte number of downlink message	Session statistics based on whole sessions
		Duration of a conversation	Session statistics based on whole sessions
Number of uplink messages	Session statistics based on whole sessions
		Number of downlink messages	Session statistics based on whole sessions

3. Preprocessing the flow characteristics: and cleaning, standardizing, recoding and the like are carried out on the collected characteristics according to different types. The specific treatment process is as follows:

a) data cleansing

Data cleansing is the process of re-examining and verifying data with the aim of deleting duplicate information, correcting existing errors and providing a data consistency check. For the characteristics of the internet traffic data collected in this embodiment, the output file generated by the stream analysis tool is analyzed, and the generated stream is transferred to the DataFrame. Since the streams generated by different terminals are to be classified, the present embodiment eliminates data with a downlink traffic of 0. This part of the data is typically a half-join that is not generated in response to a request on the other side and is of no analytical value.

b) Data normalization

The purpose of data normalization is mainly to avoid the problem that large-amplitude features submerge small-amplitude features due to the dynamic value range of different features. Commonly used normalization methods are Z-score normalization, maximum-minimum normalization, etc. In this embodiment, a maximum-minimum normalization method is adopted, and a linear transformation is performed on original data to map data values between [0,1], where the formula is as follows:

c) discrete feature processing

In order to further mine the terminal characteristics contained in the traffic, the embodiment extracts the user agent information in the TCP stream, and uses the user _ agent packet in python to process the extracted information into 6 values, i.e., a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model, from the user agent. For the discrete features represented by such texts, the discrete features are processed by two methods, namely a common onehotencor method and a common labeleencor method, and are converted into numerical expression. The method specifically comprises the following steps: the OneHotEncoder is processed into 26-dimensional characteristics for discrete characteristics represented by four types of texts, namely the browser type, the browser model, the operating system type and the operating system model, and the LabelEncoder is processed into 2-dimensional characteristics for the equipment type and the equipment model.

4. Constructing a deep learning model combining CNN and LSTM as shown in FIG. 2; the method comprises the following specific steps:

a) a sequential neural network is built by using TensorFlow, data characteristics are extracted by adopting a one-dimensional convolutional layer and a batch normalization layer, and the influence of decimal values on a model is reduced by using a one-dimensional maximum pooling layer;

b) adding two LSTM layers after the convolution layer to further learn the terminal characteristics;

c) and finally adding a full connection layer and an output layer.

It should be noted that, in the present embodiment, a classifier is built in a manner of a one-dimensional convolutional neural network + a recurrent neural network. The pre-classifier with a good classification effect is pre-trained by using a transfer learning method, and the good classification effect can be achieved by using less data when a specific terminal is classified.

The convolutional neural network is a feedforward neural network comprising a convolutional calculation structure and a depth structure, has the capability of representation learning, and can effectively extract the features of data from chaotic data without more feature selection steps.

The convolutional neural network mainly comprises a convolutional layer, a pooling layer, a full-connection layer and the like. The convolutional layer is the core of the convolutional neural network, the convolutional core is used for performing feature extraction on input data, and the formula of the one-dimensional convolutional core is as follows:

after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. Maximum pooling (max-pooling), as used herein, is the following one-dimensional maximum pooling function:

in order to prevent the problem of inconsistent data distribution of each layer in a deep convolutional neural network, a tensor obtained after convolution is processed by using Batch Normalization (BN), and the BN layer can accelerate the training and convergence speed of the network, control gradient explosion to prevent gradient disappearance and prevent overfitting. The algorithm of the BN layer is as follows:

at the output layer we use SoftMax as the output layer according to the label of the terminal based on the lab le code we use. The probability that each element yi is selected in SoftMax is:

5. training a classifier: constructing a sample data set by taking the preprocessed flow characteristics as samples and the terminal information as a label; the model is trained by adopting a transfer learning mode and utilizing a sample data set, and the method specifically comprises the following steps:

a) pre-training a classifier: collecting label data, and performing pre-training, wherein the method specifically comprises the following steps:

(1) with 1 group of 100 terminals, 200 pieces of tag data were collected per terminal.

(2) 20 sets of data were collected as pre-training data.

(3) Pre-training is performed using the collected data. Wherein, when each group of pre-training is performed, the output layer is reinitialized. Each set of data was trained for 200 rounds.

(4) And removing an output layer and storing the pre-training model.

b) And (3) retraining the classifier: and (5) performing retraining by using the data of the current scene to obtain a reliable classifier.

(1) And collecting the marked terminal sessions in a period of time, taking the collected marked terminal sessions in the period of time as a training set, and retraining the classifier.

(2) And loading a pre-training model.

(3) Adding output layers equal to the number of terminals.

(4) Train 200 rounds.

(5) The classifier is stored.

6. Carrying out flow classification and marking by using the trained classifier, which comprises the following steps:

a) and loading the trained classifier.

b) And classifying the collected flow by using the trained classifier.

c) And receiving the classification result larger than the preset threshold value as the flow of the corresponding terminal.

d) And marking the original flow.

To sum up, the embodiment uses the idea of transfer learning based on CNN + LSTM. Through the learning of the flow statistic characteristics and the flow user characteristics, the terminal classification of the flow is realized. The problem that in the prior art, the XGboost algorithm uses a decision tree to treat discrete features as continuous features, so that a large error is caused; in the prior art, the XGboost algorithm converts a multi-classification problem into a plurality of two-classification problems, so that the model is large in scale and slow in calculation; and the technical problem that the prior art is difficult to solve the real-time classification of the flow.

Second embodiment

The embodiment provides a CNN + LSTM-based traffic terminal real-time identification device, which includes the following modules:

The CNN + LSTM-based traffic terminal real-time identification apparatus of this embodiment corresponds to the CNN + LSTM-based traffic terminal real-time identification method of the first embodiment; the functions realized by each functional module in the CNN + LSTM-based flow terminal real-time identification device correspond to each flow step in the CNN + LSTM-based flow terminal real-time identification method one by one; therefore, it will not be described herein.

Third embodiment

The present embodiment provides an electronic device, which includes a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.

The electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) and one or more memories, where at least one instruction is stored in the memory, and the instruction is loaded by the processor and executes the method.

Fourth embodiment

The present embodiment provides a computer-readable storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the method of the first embodiment. The computer readable storage medium may be, among others, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The instructions stored therein may be loaded by a processor in the terminal and perform the above-described method.

Furthermore, it should be noted that the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A CNN + LSTM-based flow terminal real-time identification method is characterized by comprising the following steps:

recombining a Transmission Control Protocol (TCP) session;

2. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein said reassembling of the TCP session of the transmission control protocol comprises:

3. The CNN + LSTM-based traffic terminal real-time identification method of claim 1, wherein the extracting traffic features from the session comprises:

extracting IP quintuple, IP-Time to live, IP-ID, TCP-Window Size, TCP-ISN and TCP-MSS information from a SYN packet of a TCP session; extracting user agent information from a message containing an HTTP request; counting the number of bytes of the uplink and downlink messages, the number of the uplink and downlink messages and the session duration based on the whole session;

from the UserAgents, the extracted information is processed into a device type, a device model, a browser type, a browser model, an operating system type, and an operating system model using the user _ agent package in python.

4. The CNN + LSTM-based traffic terminal real-time identification method of claim 3, wherein the preprocessing the extracted traffic features comprises:

5. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein constructing a deep learning model combining CNN and LSTM includes:

and finally adding a full connection layer and an output layer.

6. The CNN + LSTM-based traffic terminal real-time identification method of claim 1, wherein training the model using the sample data set to obtain a classifier comprises:

loading a pre-training model;

adding output layers with the number equal to that of the terminals;

training the model for 200 rounds;

7. The CNN + LSTM-based traffic terminal real-time identification method according to claim 1, wherein the classifying and labeling the traffic using the trained classifier comprises:

loading the trained classifier;

8. A CNN + LSTM-based flow terminal real-time identification device is characterized by comprising: