CN109951464B

CN109951464B - Message sequence clustering method for unknown binary private protocol

Info

Publication number: CN109951464B
Application number: CN201910173504.6A
Authority: CN
Inventors: 杨超; 吴继超
Original assignee: Xidian University
Current assignee: Nanjing Cyber Peace Technology Co Ltd
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2021-05-14
Anticipated expiration: 2039-03-07
Also published as: CN109951464A

Abstract

The invention discloses a message sequence clustering method of an unknown binary private protocol, which mainly solves the problem that the similarity between protocol message sequences cannot be accurately measured in the protocol reverse process in the prior art. The implementation scheme is as follows: 1) collecting an unknown binary private protocol message sequence; 2) preprocessing the collected message sequence; 3) extracting multi-scale N-gram characteristics of the preprocessed message sequence; 4) selecting to reduce the dimension of the multi-scale N-gram characteristics based on the variance; 5) embedding and representing the message sequence according to the multi-scale N-gram characteristics after dimension reduction; 6) determining the optimal clustering number K according to the message sequence embedding representation; 7) and clustering the message sequences according to the optimal clustering number K. The invention fully excavates the potential semantic information of the message sequence, can accurately measure the similarity between the message sequences, improves the clustering accuracy, and can be used for clustering unknown binary private protocols.

Description

Message sequence clustering method for unknown binary private protocol

Technical Field

The invention belongs to the technical field of information, and further relates to a message sequence clustering method which can be used for clustering unknown binary private protocols.

Background

Network protocols are specifications for the communication of entities in a network that specify the format of data and the associated synchronization issues when communicating entities exchange information with one another. In addition to standardized communication protocols, there are a number of unknown proprietary protocols in networks. Message sequence clustering is the first work in the protocol reverse process, namely separating messages of various types of private protocol message sequences to the maximum extent according to the similarity between message sequences, and then performing field format inference and state machine inference.

The core problem of private protocol message sequence clustering, i.e. network protocol identification, is how to accurately measure the similarity between message sequences. The current packet sequence clustering algorithm of the unknown private protocol can be roughly divided into three categories, namely a sequence clustering algorithm based on an edit distance, a sequence clustering algorithm based on a keyword and a sequence clustering algorithm based on a probability model. Edit distance measures similarity between sequences by the minimum number of operations required to change one string to another, including inserting, deleting, and replacing a character. The idea of finding the longest common subsequence in the edit distance algorithm and the Needleman-Wunsch algorithm is similar, and from the perspective of text matching, local features between sequences are ignored, and the local features may be a key for measuring similarity between sequences in a protocol cluster, namely, a protocol keyword. In a sequence clustering algorithm based on a probability model, modeling is difficult, and the method is very effective only in long sequence clustering calculation. The sequence clustering algorithm based on the keywords is an Apriori algorithm, and has the problems that a large number of frequently overlapped items appear, so that the dimensionality of a feature vector representing a message sequence is very large. In 2013, Wang Yipeng et al pioneering that the N-gram and Dirichlet distribution LDA model in natural language processing are introduced into protocol sequence clustering, the optimal value of N is determined by using Zipf of the Ziff law, and then modeling is performed by using LDA. The method ignores the length of the key word of the protocol message sequence, does not consider the semantic association characteristics before and after words when performing message embedding representation, and can not accurately measure the similarity between the message sequences and has poor clustering effect.

Disclosure of Invention

The invention aims to provide a message sequence clustering method of an unknown binary private protocol aiming at the defects of the prior art, so that the potential semantic information in the message is fully mined in the private protocol feature extraction process, and the clustering accuracy is improved.

The technical scheme of the invention is as follows: modeling a message sequence by using an N-gram language model, extracting multi-scale N-gram characteristics under the condition of not fixing an N value, and training a message sequence word vector embedding representation by using a word2vec model, wherein the implementation steps comprise the following steps:

(1) collecting an unknown binary private protocol message sequence by using a data collection method;

(2) preprocessing an acquired unknown binary private protocol message sequence:

(2a) stripping link layer and transmission layer data of an unknown binary private protocol message sequence by a network packet analysis technology to obtain application layer binary private protocol message sequence data;

(2b) converting the binary message sequence data of the application layer into hexadecimal message sequence data according to a binary conversion rule;

(2c) marking the hexadecimal message sequence data to generate a sample data set;

(3) extracting multi-scale N-gram characteristics of the sample data set:

(3a) determining the minimum value and the maximum value range of the N value;

(3b) taking N value in the range, and cutting the sample data set by using an N-gram model to obtain a word vector of the segmented message sequence of the message as the multi-scale N-gram characteristic of the sample data set;

(4) and (3) reducing the dimension of the multi-scale N-gram feature based on variance selection:

(4a) according to the word vector of the message sequence, carrying out One-Hot coding on the message sequence by utilizing the One-Hot coding to obtain a characteristic vector space model after the message sequence is coded;

(4b) calculating the variance distribution of each eigenvector according to the eigenvector space model;

(4c) reducing the dimension of the extracted multi-scale N-gram characteristics according to the variance distribution of each characteristic vector, namely selecting the characteristic vector with larger variance as a characteristic vector vocabulary of the sample data set;

(5) and embedding and expressing the message sequence according to the feature vector vocabulary:

(5a) screening word vectors of the message sequence by using the characteristic vector vocabulary, and only leaving the word vectors in the characteristic vector vocabulary as word vector characteristics of the message sequence;

(5b) taking the word vector characteristics of the sample training set as input, and training by using a word2vec model to obtain an embedded vector dictionary wv taking the weight matrix of the hidden layer of the shallow neural network as a vocabulary;

(5c) finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequence_v；

(5d) Embedding vector E of each message sequence_vNormalizing into unit vector to obtain embedded vector matrix of message sequence

(6) Embedded vector matrix of message sequence by using MeanShift probability density estimation method

Performing modular point search to obtain the optimal clustering number K of the message sequence;

(7) clustering message sequences:

(7a) taking an embedded vector matrix of the message sequence as input, and dividing the message sequence into K sets by using a K-Means clustering method;

(7b) and respectively storing the message sequence data divided into each set.

Compared with the prior art, the invention has the following advantages:

firstly, the invention carries out multi-scale N-gram feature extraction under the condition of not fixing the N value, and overcomes the problem of embedding representation of unequal lengths of keywords of message sequences.

Secondly, the word table is embedded and expressed by a word2vec model, and semantic association characteristics of front and rear words are combined on the aspect of determining the weight of the keywords, so that the method can fully mine potential semantic information in the message, and further improve the accuracy of clustering.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The specific steps of the present invention are further described below with reference to fig. 1.

Step 1, collecting an unknown binary private protocol message sequence by using a data collection method.

(1a) Setting a network card mode of the server acquisition equipment into a hybrid mode, enabling the network card mode to monitor wireless communication data, and then opening a communication entity A and a communication entity B to establish communication connection;

(1b) and intercepting message sequence communication data between the communication entities A and B by using wireshark software, and storing the message sequence communication data as a pcap format file to obtain an unknown binary private protocol message sequence, wherein the message sequence comprises link layer data, transmission layer data and application layer data.

And 2, preprocessing the acquired unknown binary private protocol message sequence.

(2a) Analyzing an intercepted unknown binary private protocol message sequence according to the structure of a network data packet, namely stripping link layer data and transmission layer data contained in the message sequence to obtain application layer data of the message sequence, wherein the application layer data is in a binary format;

(2b) according to the conversion rule between the binary systems, for example, binary 1111 corresponds to hexadecimal as F, the application layer binary message sequence data is converted into hexadecimal message sequence data.

(2c) And marking the hexadecimal message sequence data to generate a sample data set.

And 3, extracting the multi-scale N-gram characteristics of the sample data set.

The N-gram model is a natural language processing model based on string statistics and is based on the assumption that the nth word occurrence is only related to the first N-1 words and not to any other words. The probability of occurrence of the entire sequence is equal to the product of the probabilities of occurrence of the individual words, assuming that the sequence T is formed by the sequence of words ω₁,ω₂,...,ω_nAnd (3) forming, wherein the probability of occurrence of the sequence T is as follows:

P(T)＝p(ω₁)×p(ω₂)×...×p(ω_n)

＝p(ω₁)×p(ω₂|ω₁)×...×p(ω_n|ω₁ω₂ω₃...)，

wherein P (T) is the probability of occurrence of the sequence T, P (. omega.)_i) Is the word omega_iThe probability of occurrence.

The selection of the N value in the N-gram model is very important, the integrity of segmentation data can be ensured when N is larger, but the effectiveness is reduced, and complete lexical information cannot be contained in the word segmentation process when N is too small. The fixed value of N can be extended to a range of values that can take on multiple values.

The specific implementation of extracting the multi-scale N-gram characteristics of the sample data set in the step is as follows:

(3a) the minimum value and the maximum value range of the N value are determined and are generally set to be 2-5.

(3b) Taking N values in the range, and respectively segmenting the sample data set by using an N-gram model to obtain segmented word vectors of the message sequence, for example, when N is 2, for the message sequence '020 a', the segmented word vectors can be obtained as '02200 a';

(3c) and combining the word vectors obtained under different N values to obtain the segmented word vectors of the message sequence of the message, wherein the word vectors are used as the multi-scale N-gram characteristics of the sample data set.

And 4, reducing the dimension of the multi-scale N-gram characteristics based on the variance selection.

The feature selection method can be divided into three types according to the form of feature selection: (1) a filtering method, which scores each feature according to the divergence or the correlation, sets a scoring threshold value or the number of features to be selected, and selects the features; (2) a packaging method, selecting a plurality of characteristics each time according to the target function, or excluding a plurality of characteristics; (3) in the integration method, certain machine learning algorithms and models are used for training to obtain weight coefficients of all the features, and the features are selected according to the coefficients from large to small.

The invention uses but is not limited to the variance selection in the filtering method to reduce the dimension of the characteristic, and the specific implementation is as follows:

(4a) encoding a message sequence word vector according to a One-Hot encoding rule, One-Hot encoding using an N-bit state register to encode N states, each state having its own independent register bits, and at any time, only One of which is valid, such as for a feature vector vocabulary "a 0 b 0c 0a 00005 … …", feature "a 0" may be encoded as [ 10000 … … ", feature" b0 "may be encoded as [ 01000 … … ], an embedded vector of the message sequence" a 000055 e e 00 c c 00445 a 0000505 e 5e0 e0c 0c 0c 04045 "is represented as [ 10011 … … ];

(4b) combining the embedded vectors after each message sequence is coded to obtain a characteristic vector space model;

(4c) calculating the variance distribution of each feature vector according to the feature vector space model:

wherein

For the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean value, and N is the total number of the sample data set;

(4d) and setting a score threshold value of the variance selection, and selecting the feature vector with the variance larger than the threshold value as a feature vector vocabulary of the sample data set.

And 5, training the word vector characteristics of the message sequence by using the word2vec model according to the characteristic vector vocabulary to obtain a word embedded vector dictionary wv.

The word2vec model is a shallow neural network correlation model used for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words at adjacent positions so as to represent semantic relation between word pairs, and the vector is a hidden layer of the shallow neural network.

The specific implementation of this step is as follows:

(5a) screening word vectors of the message sequence by using the feature vector vocabulary table, and only leaving the word vectors in the feature vector vocabulary table as word vector features of the message sequence, for example, the feature vector vocabulary table is "a 0 b 0c 0a 00005 … …", and for the message sequence word vector "a 000055 e e 00 c c 00445 a 0000505 e 5e0 e0c 0c 0c 04045", the obtained message sequence word vector after screening is "a 0c 0a 00005";

(5b) taking the word vector characteristics of the sample training set as input, training by using a word2vec model to obtain a weight matrix of a shallow neural network hidden layer, and taking the weight matrix as an embedded vector dictionary wv of words.

And 6, carrying out embedded representation on the message sequence.

(6a) Finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequence_v：

Wherein E_vFor each embedded vector representation of the message sequence of the message, wv [ w ]]Representing the embedded vector corresponding to each vocabulary, wherein M is the number of vocabularies contained in each message sequence;

(6b) embedding vector E of each message sequence_vNormalized into unit vector, and then combined to obtain embedded vector matrix of message sequence

Step 7, utilizing the MeanShift probability density estimation method to embed vector matrix of message sequence

And performing modular point search to obtain the optimal clustering number K of the message sequence.

(7a) Randomly selecting an embedded vector matrix

The point in (1) is taken as a starting point s;

(7b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range h_iThe sum of the required offsets and averaging to obtain the average offset

(7c) Shift point s along the average

Is moved to a new point s', the length of the movement being the average offset

Die length of

(7d) Repeating (6a) - (6c) with the moved new point s' as a new starting point, and repeating the steps until the average offset is reached

And if the value is smaller than the set threshold value or the iteration times are reached, the obtained new point s' is a clustering center, and the number of the clustering centers is the optimal clustering number K.

And 8, clustering the message sequence by using a K-Means clustering method.

In the K-Means clustering, sample data is divided into K clusters, so that each point belongs to a nearest mean value, namely a class corresponding to a clustering center, and the specific implementation is as follows:

(8a) randomly selecting an embedded vector matrix

The medium K points are used as initial clustering centers: { u₁,u₂,...,u_k}；

(8b) Computing an embedded vector matrix

All points x in_iAnd cluster center u_jEuropean distance of

X is to be_iClass λ corresponding to Euclidean distance d marked as minimum_iAt this time, the cluster is updated

(8c) Updating the clustering center u_i', the mean of all points in each cluster is taken as the new cluster center for the cluster:

(8d) by u_iRepeating (8a) - (8c) for new clustering centers, and continuously iterating until the clustering centers are not changed any more or the iteration times are reached;

(8e) and respectively storing the message sequence data divided into each cluster.

The effect of the present invention will be further described with reference to simulation experiments.

1. Simulation experiment conditions are as follows:

in the simulation experiment, two communication entities are a Dobby pocket unmanned aerial vehicle and a ground station which are delivered from ZeroTech, and communication data between the two communication entities is intercepted and used as message sequence data of an unknown binary private protocol in the simulation experiment.

The simulation conditions of the server in the simulation experiment of the invention are that the protocol message sequence is clustered and analyzed on a computer with Intel (R) core (TM) i5-8250U CPU processor, Windows 10 operating system and memory 16GB, and the simulation experiment result is tested.

2. Simulation experiment content and result analysis:

the simulation experiment of the invention is to collect the protocol message sequence data of both communication entities, and to extract the characteristics and perform cluster analysis on the server. The simulation experiment comprises the following specific steps:

step 1, collecting protocol message sequence data of both communication entities, selecting 2000 protocol message sequences for data annotation;

step 2, preprocessing the protocol message sequence data to generate application layer hexadecimal message sequence data;

step 3, setting the maximum value N of the range of N values_max；

Step 4, segmenting and combining the message sequence according to the set N value threshold range to obtain a word vector of the message sequence;

step 5, performing one-hot encoding on the message sequence word vectors, calculating the variance distribution of the feature vectors, selecting the threshold parameter of 0.8 in the simulation experiment according to the threshold value, and selecting the feature with the variance larger than the threshold value as a feature vector vocabulary;

step 6, screening word vectors of the message sequence according to the vocabulary table, taking the word vectors as input, and training by using a word2vec model to obtain an embedded vector dictionary wv of the vocabulary;

7, finding out the embedded vector representation wv [ w ] of the vocabulary in the embedded vector dictionary wv for the vocabulary w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequence_v；

Step 8, embedding vector E for each message sequence_vNormalizing into unit vector to obtain embedded vector matrix of message sequence

Step 9, determining the optimal clustering number K of the message sequence by using MeanShift modular point search;

and step 10, clustering the message sequence by using a K-Means clustering method according to the optimal clustering number K to obtain the class division of the message sequence, wherein the result is shown in a table 1 and is stored.

Table 1: summary of simulation test results

	Number of messages	Ratio of	Representation format
				Class 1	1484	74.20％	a018*
Class 2	295	14.75％	a010*
				Class 3	147	7.35％	a00f*
Class 4	74	3.70％	a005*

As can be seen from table 1, the simulation experiment result divides the collected 2000 protocol message sequence data into 4 categories, and each category message sequence representation format is "a 018 ×," a010 ×, "a 00f ×," a005 × "respectively, and is consistent with manual labeling.

3. Accuracy analysis of simulation experiment:

to demonstrate the effectiveness of the cluster analysis of the present invention, the threshold N at different ranges of N values was calculated by the following formula_maxClustering accuracy under the conditions, results are shown in table 2.

Table 2: accuracy list of message sequence clustering analysis

N_max	2	3	4	5	6	7	8	9
									Accuracy of	0.5383	0.7189	0.7247	0.9940	0.5301	0.9658	0.6019	0.6450

As can be seen from Table 2, when N is present_maxWhen the accuracy is 5, the accuracy is highest and reaches 99.40%. The result verifies the effectiveness of the invention, and shows that the method can be used as an effective unknown binary private protocol message sequence clustering method.

Claims

1. A message sequence clustering method of an unknown binary private protocol is characterized by comprising the following steps:

(2) preprocessing an acquired unknown binary private protocol message sequence:

(3) extracting multi-scale N-gram characteristics of the sample data set:

(3a) determining the minimum value and the maximum value range of the N value;

(3b) taking N value in the range, and segmenting the sample data set by using an N-gram model to obtain segmented message sequence word vectors serving as multi-scale N-gram characteristics of the sample data set;

(7) clustering message sequences:

(7b) and respectively storing the message sequence data divided into each set.

2. The method according to claim 1, wherein the binary private protocol packet sequence in (1) comprises link layer data, transport layer data, and application layer data.

3. The method of claim 1, wherein the One-Hot encoding in (4a) is performed using an N-bit status register to encode N states, each state having its own independent register bit and only One of which is active at any time.

4. The method of claim 1, wherein the variance distribution of each feature vector is calculated in (4b) by the following formula:

wherein the content of the first and second substances,

for the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean, and N is the total number of sample data sets.

5. The method of claim 1, wherein the word2vec model in (5b) is a shallow neural network correlation model for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words in neighboring positions to represent semantic relationships between word-to-word, and the vector is a hidden layer of the shallow neural network.

6. The method of claim 1, wherein said (6) embeds message sequences using a MeanShift probability density estimation methodVector matrix

And performing a mode point search, and realizing the following steps:

(6a) randomly selecting an embedded vector matrix

The point in (1) is taken as a starting point s;

(6b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range h_iThe sum of the required offsets and averaging to obtain the average offset

(6c) Shift point s along the average

Is moved to a new point s', the length of the movement being the average offset

Die length of

(6d) Repeating (6a) - (6c) with the moved new point s' as a new starting point, and repeating the steps until the average offset is reached

7. The method according to claim 1, wherein said (7a) uses K-Means clustering to divide the message packet sequence into K sets, as follows:

(8a) randomly selecting an embedding directionQuantity matrix

(8b) Computing an embedded vector matrix

All points x in_iAnd cluster center u_jEuropean distance of

(8d) by u_i' repeat (8a) - (8c) for new cluster centers and iterate until the cluster centers no longer change or the number of iterations is reached.