CN115883263B - Encryption application protocol type identification method based on multi-scale load semantic mining - Google Patents

Encryption application protocol type identification method based on multi-scale load semantic mining Download PDF

Info

Publication number
CN115883263B
CN115883263B CN202310189712.1A CN202310189712A CN115883263B CN 115883263 B CN115883263 B CN 115883263B CN 202310189712 A CN202310189712 A CN 202310189712A CN 115883263 B CN115883263 B CN 115883263B
Authority
CN
China
Prior art keywords
features
sequence
load
scale
application protocol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310189712.1A
Other languages
Chinese (zh)
Other versions
CN115883263A (en
Inventor
吉庆兵
谈程
罗杰
潘炜
康璐
倪绿林
尹浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202310189712.1A priority Critical patent/CN115883263B/en
Publication of CN115883263A publication Critical patent/CN115883263A/en
Application granted granted Critical
Publication of CN115883263B publication Critical patent/CN115883263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention provides an encryption application protocol type identification method based on multi-scale load semantic mining, which comprises the following steps: step 1, extracting load characteristics from original flow and converting the load characteristics into a decimal byte sequence; step 2, constructing a pyramid neural network based on the load semantic mining block, and processing a decimal byte sequence to obtain an input feature sequence; step 3, a load semantic mining block constructs a sliding window on the input feature sequence, the sliding window sequentially moves to the tail end of the sequence, and features extracted from the splicing window obtain features of the input sequence; step 4, performing dimension reduction on the characteristics of the input sequence to serve as a new input sequence, repeating the steps 3-4, and splicing the characteristics obtained each time to obtain multi-scale characteristics; and 5, finishing classification of the encrypted network application protocol types according to the multi-scale characteristics. The method and the device can extract the multi-scale characteristics in the encrypted network application protocol message under the complex scene, and improve the speed and the accuracy of encrypted flow identification.

Description

Encryption application protocol type identification method based on multi-scale load semantic mining
Technical Field
The invention relates to the field of flow analysis, in particular to an encryption application protocol type identification method based on multi-scale load semantic mining.
Background
Traffic classification has found very widespread application, which is the basis for network security and network management, from QoS services in network service providers to detection of security applications in firewalls and intrusion detection systems. At present, methods based on port numbers, deep packet inspection, machine learning and the like are mainly adopted for traffic classification, but certain disadvantages exist:
(1) Traditional port number based approaches have failed since newer applications either use well known port numbers to mask their traffic or do not use standard registered port numbers.
(2) The way in which deep packet inspection relies on finding keys in the packets, which can fail in the face of encrypted traffic.
(3) Encryption network traffic identification methods based on machine learning rely heavily on ergonomic features, which limit their popularity.
With the popularity of deep learning methods, researchers have studied the effects of these methods on traffic classification tasks and demonstrated higher accuracy on early mobile application traffic datasets. With the continuous upgrading of encryption protocols, explosive growth of the number of mobile applications and the change of development modes of the mobile applications, a shallow deep learning model cannot meet the actual requirements of mobile application flow identification in the current complex scene. Although the encryption traffic identification method based on the transducer has a better effect on feature learning, global features are focused more in the feature extraction process, detail features hidden in high-fraction load data are ignored, and the local features are key for realizing accurate classification in many cases.
Disclosure of Invention
In order to solve the problem that deep features in encrypted traffic cannot be learned by a shallow neural network in the current complex scene and detail features are lost due to the fact that global features are excessively focused by the existing deep neural network, the invention provides a novel encrypted network application protocol type identification method.
The technical scheme adopted by the invention is as follows: the encryption application protocol type identification method based on multi-scale load semantic mining comprises the following steps:
step 1, preprocessing the original traffic of a mobile application encryption network, extracting the load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, constructing a pyramid neural network based on a load semantic mining block, and acquiring word embedding features and position coding features of a decimal byte sequence, wherein the word embedding features and the position coding features are added to obtain an input feature sequence;
step 3, a load semantic mining block constructs a sliding window on the input feature sequence, the sliding window sequentially moves until the tail end of the input sequence, the features in the sliding window are extracted when each movement is performed, and the features extracted in all the sliding windows are sequentially spliced to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 and k times, and splicing the features of the input sequence obtained each time to obtain multi-scale features of the input sequence;
and 5, finishing classification of the encrypted network application protocol types according to the multi-scale characteristics.
Further, the pretreatment process in the step 1 is as follows:
step 1.1, dividing a data packet into session flows according to five-tuple;
step 1.2, cleaning the session stream, and removing data packets retransmitted over time, address resolution protocol and dynamic host configuration protocol;
step 1.3, extracting load characteristics of a transmission layer load in a data packet, and splicing the extracted load characteristics according to the arrival sequence of the data packet until the length of bytes after splicing reaches the set load characteristic length;
and step 1.4, converting the extracted spliced load characteristics into a decimal byte sequence.
Further, in the step 1.3, if the byte length after the payload features of all the data packets in the session stream are spliced is still smaller than the set payload feature length, the padding is performed with 0X 00.
Further, in the step 2, the byte characteristic of the decimal byte sequence is mapped to the vector space of d dimension to obtain the word embedding characteristic F1,
Figure SMS_1
where R represents a real number in the matrix.
Further, in the step 2, the method for calculating the position coding feature is as follows:
Figure SMS_2
(1)
Figure SMS_3
(2)
Figure SMS_4
(3)
where pos represents the position where the byte appears in the byte sequence, (1) left of formula
Figure SMS_6
Position coding of bytes representing even positions, (2) left +.>
Figure SMS_9
Representing the position of bytes in odd positionsCoding (I)>
Figure SMS_11
I is the modulus of the dimension subscript pair 2 of the position code, (1) represents the even position with +.>
Figure SMS_7
(2) formula represents odd positions +.>
Figure SMS_8
,/>
Figure SMS_10
Dimension for position coding->
Figure SMS_12
Is a position coding feature, in the formula (3)>
Figure SMS_5
Representing the position encoding of each byte in the byte sequence.
Further, the substep of the step 3 includes:
step 3.1, constructing a sliding window with the length of L bytes on an input sequence;
step 3.2, extracting features of the data in the sliding window by adopting a multi-head attention mechanism to obtain features F4;
step 3.3, carrying out residual connection and layer normalization processing on the input sequence F3 and the characteristic F4 to obtain a characteristic F5;
step 3.4, performing two-layer full-connection layer operation on the feature F5 to obtain a feature F6;
step 3.5, carrying out residual connection and layer normalization processing on the characteristic F5 and the characteristic F6 to obtain a characteristic F7;
step 3.6, the sliding window moves backwards by L bytes, and the steps 3.2-3.6 are repeated until the sliding window moves to the tail end of the input sequence;
and 3.7, splicing the features F7 in all the sliding windows to obtain features F8 serving as features of the input sequence.
Further, the substeps of the step 3.2 are as follows:
step 3.2.1, performing multi-head self-attention calculation on the data in the sliding window, and extracting the association relation of byte sequences in the window;
and 3.2.2, repeating the step 3.2.1 for M times according to the set attention head number M, and performing splicing and linear transformation on the extracted result each time to obtain the characteristic F4 of the data in the sliding window.
Further, in the step 4, the feature compression and dimension reduction are completed by adopting a one-dimensional maximum pooling layer, and each pooling operation halves the dimension of the first dimension of the feature.
Further, the substep of step 5 includes:
step 5.1, inputting the extracted multi-scale features into a full-connection layer and an activation function, wherein the output dimension is consistent with the number of flow categories;
and 5.2, calculating the category of the encrypted network application protocol type according to the output.
Further, in the step 5.2, the specific calculation method of the category is:
Figure SMS_13
where Z represents the output of the activation function and the multi-scale feature input fully-connected layer.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows:
1. the pyramid network constructed based on the load semantic mining block can extract multi-scale features in the message type of the encryption network application protocol under the current complex scene, and fully extract global features and multi-scale local features, so that the accuracy of encryption traffic identification is improved.
2. When the local features are extracted, a sliding window mode is adopted, and each self-attention calculation is performed in a window coverage range, so that noise is prevented from being introduced when the local features are extracted, model parameters are greatly reduced, and the calculation speed of a model is improved.
3. The method is based on the load data on the transmission layer in the network traffic data to learn and classify, and does not depend on the IP address and port number information of the head of the network traffic data packet, so that the generalization capability of the classification model is strong; the strong identification information such as the IP address and the port number information of the header of the network traffic data packet has no universality, and may cause strong interference to the final identification result.
Drawings
Fig. 1 is a flowchart of an encryption application protocol type identification method based on multi-scale load semantic mining.
Fig. 2 is a schematic diagram of a pyramid network model according to an embodiment of the invention.
FIG. 3 is a flow chart of a sliding window implementation in an embodiment of the invention.
FIG. 4 is a schematic diagram of multi-scale feature extraction in accordance with one embodiment of the invention.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar modules or modules having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the present application include all alternatives, modifications, and equivalents as may be included within the spirit and scope of the appended claims.
Aiming at the problems that a shallow neural network cannot learn deep features in encrypted traffic under the current complex scene and detail features are lost due to the fact that the existing deep neural network pays attention to global features excessively, the embodiment provides an encryption application protocol type identification method for extracting multi-scale features based on load semantic mining deep neural network.
As shown in fig. 1, the encryption application protocol type identification method based on multi-scale load semantic mining includes:
step 1, preprocessing the original traffic of a mobile application encryption network, extracting the load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, constructing a pyramid neural network based on the load semantic mining block; acquiring word embedding characteristics and position coding characteristics of a decimal byte sequence, and adding the word embedding characteristics and the position coding characteristics to obtain an input characteristic sequence;
step 3, a load semantic mining block constructs a sliding window on the input feature sequence, the sliding window sequentially moves until the tail end of the input sequence, the features in the sliding window are extracted when each movement is performed, and the features extracted in all the sliding windows are sequentially spliced to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 and k times, and splicing the features of the input sequence obtained each time to obtain multi-scale features of the input sequence;
and 5, finishing classification of the encrypted network application protocol types according to the multi-scale characteristics.
Since the pre-identification information such as the IP address and the port number information of the network traffic packet header is not universal, strong interference may be caused to the identification result, in this embodiment, learning and classification are performed based on the presence and the data on the network traffic data transmission layer, and the pre-identification information is not dependent on the IP address, the port number, and the like of the network traffic packet header.
Before parsing, the original flow needs to be preprocessed, specifically:
step 1.1, dividing the received data packet into session flows according to five-tuple (source IP, destination IP, source port, destination port, transport layer protocol), and identifying the flows by taking the session flows as units.
In step 1.2, packets that are not related to the specific traffic of the transmission content exist in the received packets, so that the session stream needs to be cleaned, and packets that are retransmitted over time, address Resolution Protocol (ARP), and Dynamic Host Configuration Protocol (DHCP) are removed. In this embodiment, the cleaning is accomplished using a wireshare Tshark tool.
And 1.3, after the irrelevant data packets are removed, extracting the load characteristics of the transmission layer load of the rest data packets, and extracting the load characteristics of the transmission layer load according to the arrival sequence of the data packets to splice until the length of the extracted bytes reaches the set load characteristic length N. It should be noted that, in this embodiment, if the byte length after the concatenation of the payload features of all the packets in the session stream is smaller than N, 0X00 is used for padding.
Preferably, the load characteristics of the transport layer load are extracted by using the rdpcap method of the Scapy tool in this embodiment.
And 1.4, converting the extracted and spliced binary load characteristics into a decimal byte sequence, namely converting each byte into a corresponding decimal number (0-255).
After obtaining the decimal byte sequence representing the transmission layer characteristics, the analysis of the traffic class can be started, and in this embodiment, features of different scales in the load (decimal byte sequence) are extracted by using the constructed Pyramid-shaped neural network (Pyramid-Transformer).
The current encryption traffic recognition model based on a transform (a deep learning framework) uses a self-attention mechanism to more attention to the extraction of global features and neglects the extraction of local features, which may be the key for realizing fine classification, and meanwhile, the phenomenon that the local features have inconsistent scales may exist, so that interference may exist in the extraction process.
As shown in fig. 2 and 4, in the embodiment, in step 2, a Pyramid-shaped neural network (pyremid-Transformer) constructed based on a plurality of load semantic mining blocks (Pyramid Transformer block) is provided, and a one-dimensional maximum pooling layer is arranged between each load semantic mining block, so as to realize compression and dimension reduction in the feature extraction process. Each load semantic mining block has the same composition and comprises six parts of multi-head attention calculation, residual error connection, layer normalization, two layers of full-connection layer+activation functions, residual error connection and layer normalization which are connected in sequence. Deep multi-scale features are extracted through a mode that a plurality of load semantic mining blocks are stacked, feature dimensions are compressed to 1/2 after each load semantic mining block extracts the features, the compressed features are input into the next load semantic mining block, the window size is unchanged, features with larger scale are extracted through the mode, the feature dimensions extracted by each load semantic mining block are reduced to form a pyramid shape, and the features are spliced to obtain final features.
Specific explanation is given to the process of realizing feature extraction of the pyramid neural network:
feature extraction is mainly completed through a load semantic mining block in the pyramid neural network, and the input of the load semantic mining block is the combination of word embedding features and position coding features of a byte sequence, so that a decimal byte sequence needs to be processed first.
For byte sequences (in figures 2, 4, areB1、B2、…、BN-1、BN-2) Performing word embedding operation, mapping byte features to d-dimensional vector space to obtain word embedded features F1 as subsequent input,
Figure SMS_14
where R represents a real number in the matrix.
Calculating position-coding features of byte sequences
Figure SMS_15
,/>
Figure SMS_16
Wherein R represents real numbers in the matrix:
Figure SMS_17
(1)
Figure SMS_18
(2)
Figure SMS_19
(3)
where pos represents the position where the byte appears in the byte sequence, (1) left of formula
Figure SMS_20
Position coding of bytes representing even positions, (2) left +.>
Figure SMS_21
Position coding of bytes representing odd positions, < >>
Figure SMS_22
I is the modulus of the dimension subscript pair 2 of the position code, (1) represents the even position with +.>
Figure SMS_23
(2) formula represents odd positions +.>
Figure SMS_24
,/>
Figure SMS_25
Dimension for position coding; (3) In->
Figure SMS_26
Representing the position encoding of each byte in the byte sequence. Because the transducer uses global information, byte order information cannot be utilized, which is important for feature learning, the present embodiment acquires position-coded features.
Combining the word embedding feature and the position coding feature according to the formula (4) to obtain the input feature of the load semantic mining block
Figure SMS_27
,/>
Figure SMS_28
Where R represents a real number in the matrix.
Figure SMS_29
(4)
After the input of the load semantic mining block is determined, the feature extraction can be performed through the load semantic mining block, and the method specifically comprises the following steps:
in step 3.1, because some detail features only exist on a small number of adjacent bytes, the direct feature extraction of the whole input sequence may cause interference to local detail features, and a sliding window is used to ensure that the high-resolution local detail features are not destroyed. Thus at the input feature
Figure SMS_30
A sliding window with the size of L is constructed, and as shown in fig. 3, feature extraction is performed on the data inside the window.
Step 3.2, acquiring the data in the sliding window as
Figure SMS_31
,/>
Figure SMS_32
Adopts a multi-head attention mechanism pair +.>
Figure SMS_33
Extracting features to obtain features->
Figure SMS_34
,/>
Figure SMS_35
,/>
Figure SMS_36
The global dependency of the bytes within the window is contained, whereas the view of the entire byte sequence is obtained here as a local feature within the window. />
The specific process comprises the following steps:
step 3.2.1, pair
Figure SMS_37
Performing multi-head self-attention calculation, and extracting correlation relation of byte sequences in windowThe system is as follows:
using a weight matrix
Figure SMS_38
,/>
Figure SMS_39
,/>
Figure SMS_40
Computing features->
Figure SMS_41
Is->
Figure SMS_42
The calculation process is shown as the formula (5), the formula (6) and the formula (7):
Figure SMS_43
(5)
Figure SMS_44
(6)
Figure SMS_45
(7)
by passing through
Figure SMS_46
The matrix operation of (a) implements a self-Attention mechanism (Attention), resulting in an output +.>
Figure SMS_47
,/>
Figure SMS_48
Figure SMS_49
(8)
Wherein,
Figure SMS_51
is->
Figure SMS_55
The column number of the matrix, i.e. the vector dimension, and +.>
Figure SMS_58
Same (I)>
Figure SMS_52
Transpose the matrix. Calculating matrix +.>
Figure SMS_54
And->
Figure SMS_57
The inner product of each row vector is divided by +.>
Figure SMS_59
. After the transpose of Q multiplied by K, the obtained matrix row and column numbers are L, wherein L is the window size, and the matrix can represent the association strength between bytes. Obtain->
Figure SMS_50
Afterwards, use +.>
Figure SMS_53
The function (normalized exponential function) calculates the self-attention coefficient of each byte for the other bytes, in the formula +.>
Figure SMS_56
Each row of the matrix is normalized, i.e. the sum of each row becomes 1.
Step 3.2.2, setting the number M of attention heads, repeating the step 3.2.1M times to obtain M output Z, and splicing and linearly transforming the M Z to obtain the characteristic
Figure SMS_60
,/>
Figure SMS_61
Figure SMS_62
Wherein,
Figure SMS_63
output representing the first calculation, +.>
Figure SMS_64
Indicate->
Figure SMS_65
Output of the secondary calculation, +.>
Figure SMS_66
Weight matrix representing a linear transformation, +.>
Figure SMS_67
Step 3.3, pair
Figure SMS_68
And->
Figure SMS_69
Performing residual connection and layer normalization operation to obtain characteristic ∈>
Figure SMS_70
Figure SMS_71
(9)
Wherein LayerNorm represents the layer normalization operation.
Step 3.4, pair
Figure SMS_72
Performing a Forward propagation (Feed Forward) operation to obtain the characteristic +.>
Figure SMS_73
,/>
Figure SMS_74
Figure SMS_75
(10)
Wherein, linear represents performing a full-connection layer operation; feed Forward consists of two fully connected layers, the first layer using an activation function RELU and the second layer not using an activation function.
Step 3.5, pair
Figure SMS_76
And->
Figure SMS_77
Performing residual connection and layer normalization operation to obtain characteristic ∈>
Figure SMS_78
,/>
Figure SMS_79
Figure SMS_80
(11)
Step 3.6, moving the sliding window backwards by L bytes, and re-executing the steps 3.2-3.5 in the new window until the sliding window moves to the input feature
Figure SMS_81
Ending;
step 3.7, feature obtained in each sliding window
Figure SMS_82
Splicing to obtain->
Figure SMS_83
,/>
Figure SMS_84
:/>
Figure SMS_85
(12)
Wherein,
Figure SMS_86
representing the characteristics obtained for the first window +.>
Figure SMS_87
,/>
Figure SMS_88
Representing the feature obtained for the last window->
Figure SMS_89
In order to extract the multi-scale features of the byte sequence, in step 4 of this embodiment, one-dimensional maximum pooling layer pair features are first used
Figure SMS_90
Performing feature compression and dimension reduction to obtain feature ∈F>
Figure SMS_91
,/>
Figure SMS_92
Figure SMS_93
(13)
The MaxPool1d represents one-dimensional maximum pooling operation, the pooling operation halves the dimension of the first dimension of the feature, and meanwhile, the new feature has more abundant semantic information.
Setting the repetition number k according to the requirement, repeating the steps 3-4 and k times except for inputting the characteristic in the process of executing the step 3 for the first time
Figure SMS_94
In addition, when step 3 is performed subsequently, the feature obtained in the last step 4 is +.>
Figure SMS_95
As an input for this time.
Will be repeatedly executed each time to obtainFeatures of (2)
Figure SMS_96
Splicing to obtain characteristic->
Figure SMS_97
Figure SMS_98
(14)
As shown in fig. 4, the repeated operation represents stacking the load semantic mining blocks of the pyramid network model multiple times, progressively extracting features of deeper and higher semantics layer by layer, in fig. 4, the feature dimension is represented by N, d, N is the same as the length of the input byte sequence, and d is the same as the dimension of each byte extension after the word embedding operation. Wherein,
Figure SMS_99
characteristic of the first repetition>
Figure SMS_100
,/>
Figure SMS_101
Characteristic of the kth operation +.>
Figure SMS_102
The characteristics obtained at this time
Figure SMS_103
I.e., multi-scale features in the desired load. After the multi-scale characteristics are obtained, flow classification can be performed:
in this embodiment, the classification process specifically includes:
step 5.1, extracting the multiscale features
Figure SMS_104
Input fully connected layer and activation function>
Figure SMS_105
Output ofDimension and number of traffic categories +.>
Figure SMS_106
And consistent.
Figure SMS_107
(15)
Wherein,
Figure SMS_108
weight matrix representing fully connected layer, +.>
Figure SMS_109
,/>
Figure SMS_110
Step 5.2, calculating and outputting the category of the encrypted network application protocol type:
Figure SMS_111
the embodiment constructs a deep neural network, namely a pyramid neural network, and stacks the load semantic mining blocks, so that deep features in the type of the encryption protocol message in the current complex scene can be extracted, and the accuracy of flow identification is improved.
It should be noted that, in the description of the embodiments of the present invention, unless explicitly specified and limited otherwise, the terms "disposed," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; may be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention will be understood in detail by those skilled in the art; the accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (9)

1. The encryption application protocol type identification method based on multi-scale load semantic mining is characterized by comprising the following steps:
step 1, preprocessing the original traffic of a mobile application encryption network, extracting the load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, constructing a pyramid neural network based on a load semantic mining block, and acquiring word embedding features and position coding features of a decimal byte sequence, wherein the word embedding features and the position coding features are added to obtain an input feature sequence;
step 3, a load semantic mining block constructs a sliding window on the input feature sequence, the sliding window sequentially moves until the tail end of the input sequence, the features in the sliding window are extracted when each movement is performed, and the features extracted in all the sliding windows are sequentially spliced to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 and k times, and performing splicing processing on the features of the input sequence obtained in each repeated step 3 to obtain multi-scale features of the input sequence;
step 5, completing classification of the encrypted network application protocol types according to the multi-scale characteristics;
the substep of the step 3 comprises the following steps:
step 3.1, constructing a sliding window with a length of L bytes on an input characteristic sequence;
step 3.2, extracting features of the data in the sliding window by adopting a multi-head attention mechanism to obtain features F4;
step 3.3, carrying out residual connection and layer normalization processing on the input sequence F3 and the characteristic F4 to obtain a characteristic F5;
step 3.4, performing two-layer full-connection layer operation on the feature F5 to obtain a feature F6;
step 3.5, carrying out residual connection and layer normalization processing on the characteristic F5 and the characteristic F6 to obtain a characteristic F7;
step 3.6, the sliding window moves backwards by L bytes, and the steps 3.2-3.6 are repeated until the sliding window moves to the tail end of the input sequence;
and 3.7, splicing the features F7 in all the sliding windows to obtain features F8 serving as features of the input sequence.
2. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 1, wherein the preprocessing process in step 1 is as follows:
step 1.1, dividing a data packet into session flows according to five-tuple;
step 1.2, cleaning the session stream, and removing data packets retransmitted over time, address resolution protocol and dynamic host configuration protocol;
step 1.3, extracting load characteristics of a transmission layer load in a data packet, and splicing the extracted load characteristics according to the arrival sequence of the data packet until the length of bytes after splicing reaches the set load characteristic length;
and step 1.4, converting the extracted spliced load characteristics into a decimal byte sequence.
3. The encryption application protocol type recognition method based on multi-scale payload semantic mining according to claim 2, wherein in the step 1.3, if the byte length after the payload features of all the data packets in the session stream are spliced is still smaller than the set payload feature length, the byte length is filled with 0X 00.
4. The encryption application protocol type recognition method based on multi-scale payload semantic mining according to claim 1 or 2, wherein in step 2, byte characteristics of the decimal byte sequence are mappedInjecting into the vector space of d dimension to obtain word embedding feature F1,
Figure QLYQS_1
where R represents a real number in the matrix.
5. The encryption application protocol type recognition method based on multi-scale load semantic mining according to claim 4, wherein in the step 2, the position coding feature calculation method is as follows:
Figure QLYQS_2
(1)/>
Figure QLYQS_3
(2)
Figure QLYQS_4
(3)
where pos represents the position where the byte appears in the byte sequence, (1) left of formula
Figure QLYQS_7
Position coding of bytes representing even positions, (2) left +.>
Figure QLYQS_8
Position coding of bytes representing odd positions, < >>
Figure QLYQS_10
I is the modulus of the dimension subscript pair 2 of the position code, (1) represents the even position with +.>
Figure QLYQS_6
(2) formula represents odd positions +.>
Figure QLYQS_9
,/>
Figure QLYQS_11
Dimension for position coding->
Figure QLYQS_12
Is a position coding feature, in the formula (3)>
Figure QLYQS_5
Representing the position encoding of each byte in the byte sequence.
6. The encryption application protocol type identification method based on multi-scale payload semantic mining according to claim 1, wherein the substeps of step 3.2 are:
step 3.2.1, performing multi-head self-attention calculation on the data in the sliding window, and extracting the association relation of byte sequences in the window;
and 3.2.2, repeating the step 3.2.1 for M times according to the set attention head number M, and performing splicing and linear transformation on the extracted result each time to obtain the characteristic F4 of the data in the sliding window.
7. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 1, wherein in the step 4, feature compression and dimension reduction are completed by adopting a one-dimensional maximum pooling layer, and each pooling operation halves the dimension of a first dimension of a feature.
8. The encryption application protocol type recognition method based on multi-scale payload semantic mining according to claim 1, wherein the substep of step 5 includes:
step 5.1, inputting the extracted multi-scale features into a full-connection layer and an activation function, wherein the output dimension is consistent with the number of flow categories;
and 5.2, calculating the category of the encrypted network application protocol type according to the output.
9. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 8, wherein in the step 5.2, the specific calculation method of the category is:
Figure QLYQS_13
wherein,
Figure QLYQS_14
representing class, Z represents the output of the multi-scale feature input fully connected layer and activation function. />
CN202310189712.1A 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining Active CN115883263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310189712.1A CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310189712.1A CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Publications (2)

Publication Number Publication Date
CN115883263A CN115883263A (en) 2023-03-31
CN115883263B true CN115883263B (en) 2023-05-09

Family

ID=85761794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310189712.1A Active CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Country Status (1)

Country Link
CN (1) CN115883263B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104052749A (en) * 2014-06-23 2014-09-17 中国科学技术大学 Method for identifying link-layer protocol data types
CN104506484A (en) * 2014-11-11 2015-04-08 中国电子科技集团公司第三十研究所 Proprietary protocol analysis and identification method
CN105430021A (en) * 2015-12-31 2016-03-23 中国人民解放军国防科学技术大学 Encrypted traffic identification method based on load adjacent probability model
EP3111612A1 (en) * 2014-02-28 2017-01-04 British Telecommunications Public Limited Company Profiling for malicious encrypted network traffic identification
CN110532564A (en) * 2019-08-30 2019-12-03 中国人民解放军陆军工程大学 A kind of application layer protocol online recognition method based on CNN and LSTM mixed model
CN111211948A (en) * 2020-01-15 2020-05-29 太原理工大学 Shodan flow identification method based on load characteristics and statistical characteristics
CN112163594A (en) * 2020-08-28 2021-01-01 南京邮电大学 Network encryption traffic identification method and device
CN112511555A (en) * 2020-12-15 2021-03-16 中国电子科技集团公司第三十研究所 Private encryption protocol message classification method based on sparse representation and convolutional neural network
WO2022094926A1 (en) * 2020-11-06 2022-05-12 中国科学院深圳先进技术研究院 Encrypted traffic identification method, and system, terminal and storage medium
CN115277888A (en) * 2022-09-26 2022-11-01 中国电子科技集团公司第三十研究所 Method and system for analyzing message type of mobile application encryption protocol
CN115348215A (en) * 2022-07-25 2022-11-15 南京信息工程大学 Encrypted network flow classification method based on space-time attention mechanism
CN115348198A (en) * 2022-10-19 2022-11-15 中国电子科技集团公司第三十研究所 Unknown encryption protocol identification and classification method, device and medium based on feature retrieval

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107637041B (en) * 2015-03-17 2020-09-29 英国电讯有限公司 Method and system for identifying malicious encrypted network traffic and computer program element
CN113949653B (en) * 2021-10-18 2023-07-07 中铁二院工程集团有限责任公司 Encryption protocol identification method and system based on deep learning
CN114358118A (en) * 2021-11-29 2022-04-15 南京邮电大学 Multi-task encrypted network traffic classification method based on cross-modal feature fusion

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3111612A1 (en) * 2014-02-28 2017-01-04 British Telecommunications Public Limited Company Profiling for malicious encrypted network traffic identification
CN104052749A (en) * 2014-06-23 2014-09-17 中国科学技术大学 Method for identifying link-layer protocol data types
CN104506484A (en) * 2014-11-11 2015-04-08 中国电子科技集团公司第三十研究所 Proprietary protocol analysis and identification method
CN105430021A (en) * 2015-12-31 2016-03-23 中国人民解放军国防科学技术大学 Encrypted traffic identification method based on load adjacent probability model
CN110532564A (en) * 2019-08-30 2019-12-03 中国人民解放军陆军工程大学 A kind of application layer protocol online recognition method based on CNN and LSTM mixed model
CN111211948A (en) * 2020-01-15 2020-05-29 太原理工大学 Shodan flow identification method based on load characteristics and statistical characteristics
CN112163594A (en) * 2020-08-28 2021-01-01 南京邮电大学 Network encryption traffic identification method and device
WO2022041394A1 (en) * 2020-08-28 2022-03-03 南京邮电大学 Method and apparatus for identifying network encrypted traffic
WO2022094926A1 (en) * 2020-11-06 2022-05-12 中国科学院深圳先进技术研究院 Encrypted traffic identification method, and system, terminal and storage medium
CN112511555A (en) * 2020-12-15 2021-03-16 中国电子科技集团公司第三十研究所 Private encryption protocol message classification method based on sparse representation and convolutional neural network
CN115348215A (en) * 2022-07-25 2022-11-15 南京信息工程大学 Encrypted network flow classification method based on space-time attention mechanism
CN115277888A (en) * 2022-09-26 2022-11-01 中国电子科技集团公司第三十研究所 Method and system for analyzing message type of mobile application encryption protocol
CN115348198A (en) * 2022-10-19 2022-11-15 中国电子科技集团公司第三十研究所 Unknown encryption protocol identification and classification method, device and medium based on feature retrieval

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Jinhai Zhang.Research on Key Technology of VPN Protocol Recognition.《2018 IEEE International Conference of Safety Produce Informatization (IICSPI)》.2019,161-164页. *
刘帅.基于机器学习的加密流量识别研究与实现.《中国优秀硕士学位论文全文数据库 信息科技辑》.2021,I139-28页. *

Also Published As

Publication number Publication date
CN115883263A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN112163594B (en) Network encryption traffic identification method and device
JP4456554B2 (en) Data compression method and compressed data transmission method
CN104036012B (en) Dictionary learning, vision bag of words feature extracting method and searching system
CN112511555A (en) Private encryption protocol message classification method based on sparse representation and convolutional neural network
CN109818930B (en) Communication text data transmission method based on TCP protocol
CN113179223A (en) Network application identification method and system based on deep learning and serialization features
CN108462707B (en) Mobile application identification method based on deep learning sequence analysis
CN113313156A (en) Internet of things equipment identification method and system based on time sequence load flow fingerprints
EP3716547A1 (en) Data stream recognition method and apparatus
CN113780447A (en) Sensitive data discovery and identification method and system based on flow analysis
CN116192523A (en) Industrial control abnormal flow monitoring method and system based on neural network
CN112887291A (en) I2P traffic identification method and system based on deep learning
CN104463922B (en) A kind of characteristics of image coding and recognition methods based on integrated study
CN115883263B (en) Encryption application protocol type identification method based on multi-scale load semantic mining
CN112383488B (en) Content identification method suitable for encrypted and non-encrypted data streams
CN115248924A (en) Two-dimensional code processing method and device, electronic equipment and storage medium
CN108563795B (en) Pairs method for accelerating matching of regular expressions of compressed flow
CN104767998B (en) A kind of visual signature coding method and device towards video
CN108573069B (en) Twins method for accelerating matching of regular expressions of compressed flow
CN114553790A (en) Multi-mode feature-based small sample learning Internet of things traffic classification method and system
CN113852605A (en) Protocol format automatic inference method and system based on relational reasoning
CN115473850A (en) Real-time data filtering method and system based on AI and storage medium
US20070050489A1 (en) Method to Exchange Objects Between Object-Oriented and Non-Object-Oriented Environments
CN114048799A (en) Zero-day traffic classification method based on statistical information and payload coding
JP4456574B2 (en) Compressed data transmission method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant