CN109951464B - Message sequence clustering method for unknown binary private protocol - Google Patents

Message sequence clustering method for unknown binary private protocol Download PDF

Info

Publication number
CN109951464B
CN109951464B CN201910173504.6A CN201910173504A CN109951464B CN 109951464 B CN109951464 B CN 109951464B CN 201910173504 A CN201910173504 A CN 201910173504A CN 109951464 B CN109951464 B CN 109951464B
Authority
CN
China
Prior art keywords
message sequence
vector
clustering
word
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910173504.6A
Other languages
Chinese (zh)
Other versions
CN109951464A (en
Inventor
杨超
吴继超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Cyber Peace Technology Co Ltd
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910173504.6A priority Critical patent/CN109951464B/en
Publication of CN109951464A publication Critical patent/CN109951464A/en
Application granted granted Critical
Publication of CN109951464B publication Critical patent/CN109951464B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a message sequence clustering method of an unknown binary private protocol, which mainly solves the problem that the similarity between protocol message sequences cannot be accurately measured in the protocol reverse process in the prior art. The implementation scheme is as follows: 1) collecting an unknown binary private protocol message sequence; 2) preprocessing the collected message sequence; 3) extracting multi-scale N-gram characteristics of the preprocessed message sequence; 4) selecting to reduce the dimension of the multi-scale N-gram characteristics based on the variance; 5) embedding and representing the message sequence according to the multi-scale N-gram characteristics after dimension reduction; 6) determining the optimal clustering number K according to the message sequence embedding representation; 7) and clustering the message sequences according to the optimal clustering number K. The invention fully excavates the potential semantic information of the message sequence, can accurately measure the similarity between the message sequences, improves the clustering accuracy, and can be used for clustering unknown binary private protocols.

Description

Message sequence clustering method for unknown binary private protocol
Technical Field
The invention belongs to the technical field of information, and further relates to a message sequence clustering method which can be used for clustering unknown binary private protocols.
Background
Network protocols are specifications for the communication of entities in a network that specify the format of data and the associated synchronization issues when communicating entities exchange information with one another. In addition to standardized communication protocols, there are a number of unknown proprietary protocols in networks. Message sequence clustering is the first work in the protocol reverse process, namely separating messages of various types of private protocol message sequences to the maximum extent according to the similarity between message sequences, and then performing field format inference and state machine inference.
The core problem of private protocol message sequence clustering, i.e. network protocol identification, is how to accurately measure the similarity between message sequences. The current packet sequence clustering algorithm of the unknown private protocol can be roughly divided into three categories, namely a sequence clustering algorithm based on an edit distance, a sequence clustering algorithm based on a keyword and a sequence clustering algorithm based on a probability model. Edit distance measures similarity between sequences by the minimum number of operations required to change one string to another, including inserting, deleting, and replacing a character. The idea of finding the longest common subsequence in the edit distance algorithm and the Needleman-Wunsch algorithm is similar, and from the perspective of text matching, local features between sequences are ignored, and the local features may be a key for measuring similarity between sequences in a protocol cluster, namely, a protocol keyword. In a sequence clustering algorithm based on a probability model, modeling is difficult, and the method is very effective only in long sequence clustering calculation. The sequence clustering algorithm based on the keywords is an Apriori algorithm, and has the problems that a large number of frequently overlapped items appear, so that the dimensionality of a feature vector representing a message sequence is very large. In 2013, Wang Yipeng et al pioneering that the N-gram and Dirichlet distribution LDA model in natural language processing are introduced into protocol sequence clustering, the optimal value of N is determined by using Zipf of the Ziff law, and then modeling is performed by using LDA. The method ignores the length of the key word of the protocol message sequence, does not consider the semantic association characteristics before and after words when performing message embedding representation, and can not accurately measure the similarity between the message sequences and has poor clustering effect.
Disclosure of Invention
The invention aims to provide a message sequence clustering method of an unknown binary private protocol aiming at the defects of the prior art, so that the potential semantic information in the message is fully mined in the private protocol feature extraction process, and the clustering accuracy is improved.
The technical scheme of the invention is as follows: modeling a message sequence by using an N-gram language model, extracting multi-scale N-gram characteristics under the condition of not fixing an N value, and training a message sequence word vector embedding representation by using a word2vec model, wherein the implementation steps comprise the following steps:
(1) collecting an unknown binary private protocol message sequence by using a data collection method;
(2) preprocessing an acquired unknown binary private protocol message sequence:
(2a) stripping link layer and transmission layer data of an unknown binary private protocol message sequence by a network packet analysis technology to obtain application layer binary private protocol message sequence data;
(2b) converting the binary message sequence data of the application layer into hexadecimal message sequence data according to a binary conversion rule;
(2c) marking the hexadecimal message sequence data to generate a sample data set;
(3) extracting multi-scale N-gram characteristics of the sample data set:
(3a) determining the minimum value and the maximum value range of the N value;
(3b) taking N value in the range, and cutting the sample data set by using an N-gram model to obtain a word vector of the segmented message sequence of the message as the multi-scale N-gram characteristic of the sample data set;
(4) and (3) reducing the dimension of the multi-scale N-gram feature based on variance selection:
(4a) according to the word vector of the message sequence, carrying out One-Hot coding on the message sequence by utilizing the One-Hot coding to obtain a characteristic vector space model after the message sequence is coded;
(4b) calculating the variance distribution of each eigenvector according to the eigenvector space model;
(4c) reducing the dimension of the extracted multi-scale N-gram characteristics according to the variance distribution of each characteristic vector, namely selecting the characteristic vector with larger variance as a characteristic vector vocabulary of the sample data set;
(5) and embedding and expressing the message sequence according to the feature vector vocabulary:
(5a) screening word vectors of the message sequence by using the characteristic vector vocabulary, and only leaving the word vectors in the characteristic vector vocabulary as word vector characteristics of the message sequence;
(5b) taking the word vector characteristics of the sample training set as input, and training by using a word2vec model to obtain an embedded vector dictionary wv taking the weight matrix of the hidden layer of the shallow neural network as a vocabulary;
(5c) finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev
(5d) Embedding vector E of each message sequencevNormalizing into unit vector to obtain embedded vector matrix of message sequence
Figure GDA0003002372240000031
(6) Embedded vector matrix of message sequence by using MeanShift probability density estimation method
Figure GDA0003002372240000032
Performing modular point search to obtain the optimal clustering number K of the message sequence;
(7) clustering message sequences:
(7a) taking an embedded vector matrix of the message sequence as input, and dividing the message sequence into K sets by using a K-Means clustering method;
(7b) and respectively storing the message sequence data divided into each set.
Compared with the prior art, the invention has the following advantages:
firstly, the invention carries out multi-scale N-gram feature extraction under the condition of not fixing the N value, and overcomes the problem of embedding representation of unequal lengths of keywords of message sequences.
Secondly, the word table is embedded and expressed by a word2vec model, and semantic association characteristics of front and rear words are combined on the aspect of determining the weight of the keywords, so that the method can fully mine potential semantic information in the message, and further improve the accuracy of clustering.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The specific steps of the present invention are further described below with reference to fig. 1.
Step 1, collecting an unknown binary private protocol message sequence by using a data collection method.
(1a) Setting a network card mode of the server acquisition equipment into a hybrid mode, enabling the network card mode to monitor wireless communication data, and then opening a communication entity A and a communication entity B to establish communication connection;
(1b) and intercepting message sequence communication data between the communication entities A and B by using wireshark software, and storing the message sequence communication data as a pcap format file to obtain an unknown binary private protocol message sequence, wherein the message sequence comprises link layer data, transmission layer data and application layer data.
And 2, preprocessing the acquired unknown binary private protocol message sequence.
(2a) Analyzing an intercepted unknown binary private protocol message sequence according to the structure of a network data packet, namely stripping link layer data and transmission layer data contained in the message sequence to obtain application layer data of the message sequence, wherein the application layer data is in a binary format;
(2b) according to the conversion rule between the binary systems, for example, binary 1111 corresponds to hexadecimal as F, the application layer binary message sequence data is converted into hexadecimal message sequence data.
(2c) And marking the hexadecimal message sequence data to generate a sample data set.
And 3, extracting the multi-scale N-gram characteristics of the sample data set.
The N-gram model is a natural language processing model based on string statistics and is based on the assumption that the nth word occurrence is only related to the first N-1 words and not to any other words. The probability of occurrence of the entire sequence is equal to the product of the probabilities of occurrence of the individual words, assuming that the sequence T is formed by the sequence of words ω12,...,ωnAnd (3) forming, wherein the probability of occurrence of the sequence T is as follows:
P(T)=p(ω1)×p(ω2)×...×p(ωn)
=p(ω1)×p(ω21)×...×p(ωn1ω2ω3...),
wherein P (T) is the probability of occurrence of the sequence T, P (. omega.)i) Is the word omegaiThe probability of occurrence.
The selection of the N value in the N-gram model is very important, the integrity of segmentation data can be ensured when N is larger, but the effectiveness is reduced, and complete lexical information cannot be contained in the word segmentation process when N is too small. The fixed value of N can be extended to a range of values that can take on multiple values.
The specific implementation of extracting the multi-scale N-gram characteristics of the sample data set in the step is as follows:
(3a) the minimum value and the maximum value range of the N value are determined and are generally set to be 2-5.
(3b) Taking N values in the range, and respectively segmenting the sample data set by using an N-gram model to obtain segmented word vectors of the message sequence, for example, when N is 2, for the message sequence '020 a', the segmented word vectors can be obtained as '02200 a';
(3c) and combining the word vectors obtained under different N values to obtain the segmented word vectors of the message sequence of the message, wherein the word vectors are used as the multi-scale N-gram characteristics of the sample data set.
And 4, reducing the dimension of the multi-scale N-gram characteristics based on the variance selection.
The feature selection method can be divided into three types according to the form of feature selection: (1) a filtering method, which scores each feature according to the divergence or the correlation, sets a scoring threshold value or the number of features to be selected, and selects the features; (2) a packaging method, selecting a plurality of characteristics each time according to the target function, or excluding a plurality of characteristics; (3) in the integration method, certain machine learning algorithms and models are used for training to obtain weight coefficients of all the features, and the features are selected according to the coefficients from large to small.
The invention uses but is not limited to the variance selection in the filtering method to reduce the dimension of the characteristic, and the specific implementation is as follows:
(4a) encoding a message sequence word vector according to a One-Hot encoding rule, One-Hot encoding using an N-bit state register to encode N states, each state having its own independent register bits, and at any time, only One of which is valid, such as for a feature vector vocabulary "a 0 b 0c 0a 00005 … …", feature "a 0" may be encoded as [ 10000 … … ", feature" b0 "may be encoded as [ 01000 … … ], an embedded vector of the message sequence" a 000055 e e 00 c c 00445 a 0000505 e 5e0 e0c 0c 0c 04045 "is represented as [ 10011 … … ];
(4b) combining the embedded vectors after each message sequence is coded to obtain a characteristic vector space model;
(4c) calculating the variance distribution of each feature vector according to the feature vector space model:
Figure GDA0003002372240000051
wherein
Figure GDA0003002372240000052
For the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean value, and N is the total number of the sample data set;
(4d) and setting a score threshold value of the variance selection, and selecting the feature vector with the variance larger than the threshold value as a feature vector vocabulary of the sample data set.
And 5, training the word vector characteristics of the message sequence by using the word2vec model according to the characteristic vector vocabulary to obtain a word embedded vector dictionary wv.
The word2vec model is a shallow neural network correlation model used for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words at adjacent positions so as to represent semantic relation between word pairs, and the vector is a hidden layer of the shallow neural network.
The specific implementation of this step is as follows:
(5a) screening word vectors of the message sequence by using the feature vector vocabulary table, and only leaving the word vectors in the feature vector vocabulary table as word vector features of the message sequence, for example, the feature vector vocabulary table is "a 0 b 0c 0a 00005 … …", and for the message sequence word vector "a 000055 e e 00 c c 00445 a 0000505 e 5e0 e0c 0c 0c 04045", the obtained message sequence word vector after screening is "a 0c 0a 00005";
(5b) taking the word vector characteristics of the sample training set as input, training by using a word2vec model to obtain a weight matrix of a shallow neural network hidden layer, and taking the weight matrix as an embedded vector dictionary wv of words.
And 6, carrying out embedded representation on the message sequence.
(6a) Finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev
Figure GDA0003002372240000053
Wherein EvFor each embedded vector representation of the message sequence of the message, wv [ w ]]Representing the embedded vector corresponding to each vocabulary, wherein M is the number of vocabularies contained in each message sequence;
(6b) embedding vector E of each message sequencevNormalized into unit vector, and then combined to obtain embedded vector matrix of message sequence
Figure GDA0003002372240000054
Step 7, utilizing the MeanShift probability density estimation method to embed vector matrix of message sequence
Figure GDA0003002372240000061
And performing modular point search to obtain the optimal clustering number K of the message sequence.
(7a) Randomly selecting an embedded vector matrix
Figure GDA0003002372240000062
The point in (1) is taken as a starting point s;
(7b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range hiThe sum of the required offsets and averaging to obtain the average offset
Figure GDA0003002372240000063
(7c) Shift point s along the average
Figure GDA0003002372240000064
Is moved to a new point s', the length of the movement being the average offset
Figure GDA0003002372240000065
Die length of
Figure GDA0003002372240000066
(7d) Repeating (6a) - (6c) with the moved new point s' as a new starting point, and repeating the steps until the average offset is reached
Figure GDA0003002372240000067
And if the value is smaller than the set threshold value or the iteration times are reached, the obtained new point s' is a clustering center, and the number of the clustering centers is the optimal clustering number K.
And 8, clustering the message sequence by using a K-Means clustering method.
In the K-Means clustering, sample data is divided into K clusters, so that each point belongs to a nearest mean value, namely a class corresponding to a clustering center, and the specific implementation is as follows:
(8a) randomly selecting an embedded vector matrix
Figure GDA0003002372240000068
The medium K points are used as initial clustering centers: { u1,u2,...,uk};
(8b) Computing an embedded vector matrix
Figure GDA0003002372240000069
All points x iniAnd cluster center ujEuropean distance of
Figure GDA00030023722400000610
X is to beiClass λ corresponding to Euclidean distance d marked as minimumiAt this time, the cluster is updated
Figure GDA00030023722400000611
(8c) Updating the clustering center ui', the mean of all points in each cluster is taken as the new cluster center for the cluster:
Figure GDA00030023722400000612
(8d) by uiRepeating (8a) - (8c) for new clustering centers, and continuously iterating until the clustering centers are not changed any more or the iteration times are reached;
(8e) and respectively storing the message sequence data divided into each cluster.
The effect of the present invention will be further described with reference to simulation experiments.
1. Simulation experiment conditions are as follows:
in the simulation experiment, two communication entities are a Dobby pocket unmanned aerial vehicle and a ground station which are delivered from ZeroTech, and communication data between the two communication entities is intercepted and used as message sequence data of an unknown binary private protocol in the simulation experiment.
The simulation conditions of the server in the simulation experiment of the invention are that the protocol message sequence is clustered and analyzed on a computer with Intel (R) core (TM) i5-8250U CPU processor, Windows 10 operating system and memory 16GB, and the simulation experiment result is tested.
2. Simulation experiment content and result analysis:
the simulation experiment of the invention is to collect the protocol message sequence data of both communication entities, and to extract the characteristics and perform cluster analysis on the server. The simulation experiment comprises the following specific steps:
step 1, collecting protocol message sequence data of both communication entities, selecting 2000 protocol message sequences for data annotation;
step 2, preprocessing the protocol message sequence data to generate application layer hexadecimal message sequence data;
step 3, setting the maximum value N of the range of N valuesmax
Step 4, segmenting and combining the message sequence according to the set N value threshold range to obtain a word vector of the message sequence;
step 5, performing one-hot encoding on the message sequence word vectors, calculating the variance distribution of the feature vectors, selecting the threshold parameter of 0.8 in the simulation experiment according to the threshold value, and selecting the feature with the variance larger than the threshold value as a feature vector vocabulary;
step 6, screening word vectors of the message sequence according to the vocabulary table, taking the word vectors as input, and training by using a word2vec model to obtain an embedded vector dictionary wv of the vocabulary;
7, finding out the embedded vector representation wv [ w ] of the vocabulary in the embedded vector dictionary wv for the vocabulary w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev
Step 8, embedding vector E for each message sequencevNormalizing into unit vector to obtain embedded vector matrix of message sequence
Figure GDA0003002372240000071
Step 9, determining the optimal clustering number K of the message sequence by using MeanShift modular point search;
and step 10, clustering the message sequence by using a K-Means clustering method according to the optimal clustering number K to obtain the class division of the message sequence, wherein the result is shown in a table 1 and is stored.
Table 1: summary of simulation test results
Number of messages Ratio of Representation format
Class 1 1484 74.20% a018*
Class 2 295 14.75% a010*
Class 3 147 7.35% a00f*
Class 4 74 3.70% a005*
As can be seen from table 1, the simulation experiment result divides the collected 2000 protocol message sequence data into 4 categories, and each category message sequence representation format is "a 018 ×," a010 ×, "a 00f ×," a005 × "respectively, and is consistent with manual labeling.
3. Accuracy analysis of simulation experiment:
to demonstrate the effectiveness of the cluster analysis of the present invention, the threshold N at different ranges of N values was calculated by the following formulamaxClustering accuracy under the conditions, results are shown in table 2.
Figure GDA0003002372240000081
Table 2: accuracy list of message sequence clustering analysis
Nmax 2 3 4 5 6 7 8 9
Accuracy of 0.5383 0.7189 0.7247 0.9940 0.5301 0.9658 0.6019 0.6450
As can be seen from Table 2, when N is presentmaxWhen the accuracy is 5, the accuracy is highest and reaches 99.40%. The result verifies the effectiveness of the invention, and shows that the method can be used as an effective unknown binary private protocol message sequence clustering method.

Claims (7)

1. A message sequence clustering method of an unknown binary private protocol is characterized by comprising the following steps:
(1) collecting an unknown binary private protocol message sequence by using a data collection method;
(2) preprocessing an acquired unknown binary private protocol message sequence:
(2a) stripping link layer and transmission layer data of an unknown binary private protocol message sequence by a network packet analysis technology to obtain application layer binary private protocol message sequence data;
(2b) converting the binary message sequence data of the application layer into hexadecimal message sequence data according to a binary conversion rule;
(2c) marking the hexadecimal message sequence data to generate a sample data set;
(3) extracting multi-scale N-gram characteristics of the sample data set:
(3a) determining the minimum value and the maximum value range of the N value;
(3b) taking N value in the range, and segmenting the sample data set by using an N-gram model to obtain segmented message sequence word vectors serving as multi-scale N-gram characteristics of the sample data set;
(4) and (3) reducing the dimension of the multi-scale N-gram feature based on variance selection:
(4a) according to the word vector of the message sequence, carrying out One-Hot coding on the message sequence by utilizing the One-Hot coding to obtain a characteristic vector space model after the message sequence is coded;
(4b) calculating the variance distribution of each eigenvector according to the eigenvector space model;
(4c) reducing the dimension of the extracted multi-scale N-gram characteristics according to the variance distribution of each characteristic vector, namely selecting the characteristic vector with larger variance as a characteristic vector vocabulary of the sample data set;
(5) and embedding and expressing the message sequence according to the feature vector vocabulary:
(5a) screening word vectors of the message sequence by using the characteristic vector vocabulary, and only leaving the word vectors in the characteristic vector vocabulary as word vector characteristics of the message sequence;
(5b) taking the word vector characteristics of the sample training set as input, and training by using a word2vec model to obtain an embedded vector dictionary wv taking the weight matrix of the hidden layer of the shallow neural network as a vocabulary;
(5c) finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev
(5d) Embedding vector E of each message sequencevNormalizing into unit vector to obtain embedded vector matrix of message sequence
Figure FDA0003002372230000021
(6) Embedded vector matrix of message sequence by using MeanShift probability density estimation method
Figure FDA0003002372230000022
Performing modular point search to obtain the optimal clustering number K of the message sequence;
(7) clustering message sequences:
(7a) taking an embedded vector matrix of the message sequence as input, and dividing the message sequence into K sets by using a K-Means clustering method;
(7b) and respectively storing the message sequence data divided into each set.
2. The method according to claim 1, wherein the binary private protocol packet sequence in (1) comprises link layer data, transport layer data, and application layer data.
3. The method of claim 1, wherein the One-Hot encoding in (4a) is performed using an N-bit status register to encode N states, each state having its own independent register bit and only One of which is active at any time.
4. The method of claim 1, wherein the variance distribution of each feature vector is calculated in (4b) by the following formula:
Figure FDA0003002372230000023
wherein the content of the first and second substances,
Figure FDA0003002372230000024
for the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean, and N is the total number of sample data sets.
5. The method of claim 1, wherein the word2vec model in (5b) is a shallow neural network correlation model for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words in neighboring positions to represent semantic relationships between word-to-word, and the vector is a hidden layer of the shallow neural network.
6. The method of claim 1, wherein said (6) embeds message sequences using a MeanShift probability density estimation methodVector matrix
Figure FDA0003002372230000031
And performing a mode point search, and realizing the following steps:
(6a) randomly selecting an embedded vector matrix
Figure FDA0003002372230000032
The point in (1) is taken as a starting point s;
(6b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range hiThe sum of the required offsets and averaging to obtain the average offset
Figure FDA0003002372230000033
(6c) Shift point s along the average
Figure FDA0003002372230000034
Is moved to a new point s', the length of the movement being the average offset
Figure FDA0003002372230000035
Die length of
Figure FDA0003002372230000036
(6d) Repeating (6a) - (6c) with the moved new point s' as a new starting point, and repeating the steps until the average offset is reached
Figure FDA0003002372230000037
And if the value is smaller than the set threshold value or the iteration times are reached, the obtained new point s' is a clustering center, and the number of the clustering centers is the optimal clustering number K.
7. The method according to claim 1, wherein said (7a) uses K-Means clustering to divide the message packet sequence into K sets, as follows:
(8a) randomly selecting an embedding directionQuantity matrix
Figure FDA0003002372230000038
The medium K points are used as initial clustering centers: { u1,u2,...,uk};
(8b) Computing an embedded vector matrix
Figure FDA0003002372230000039
All points x iniAnd cluster center ujEuropean distance of
Figure FDA00030023722300000310
X is to beiClass λ corresponding to Euclidean distance d marked as minimumiAt this time, the cluster is updated
Figure FDA00030023722300000311
(8c) Updating the clustering center ui', the mean of all points in each cluster is taken as the new cluster center for the cluster:
Figure FDA00030023722300000312
(8d) by ui' repeat (8a) - (8c) for new cluster centers and iterate until the cluster centers no longer change or the number of iterations is reached.
CN201910173504.6A 2019-03-07 2019-03-07 Message sequence clustering method for unknown binary private protocol Active CN109951464B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910173504.6A CN109951464B (en) 2019-03-07 2019-03-07 Message sequence clustering method for unknown binary private protocol

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910173504.6A CN109951464B (en) 2019-03-07 2019-03-07 Message sequence clustering method for unknown binary private protocol

Publications (2)

Publication Number Publication Date
CN109951464A CN109951464A (en) 2019-06-28
CN109951464B true CN109951464B (en) 2021-05-14

Family

ID=67008531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910173504.6A Active CN109951464B (en) 2019-03-07 2019-03-07 Message sequence clustering method for unknown binary private protocol

Country Status (1)

Country Link
CN (1) CN109951464B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110602073B (en) * 2019-09-02 2021-05-18 西安电子科技大学 Unmanned aerial vehicle flight control protocol field division method based on information theory
CN112367325B (en) * 2020-11-13 2023-11-07 中国人民解放军陆军工程大学 Unknown protocol message clustering method and system based on closed frequent item mining
CN112398865B (en) * 2020-11-20 2022-11-08 苏州攀秉科技有限公司 Application layer information reasoning method under multilayer protocol nesting condition
CN114724069B (en) * 2022-04-09 2023-04-07 北京天防安全科技有限公司 Video equipment model confirming method, device, equipment and medium
CN115334179B (en) * 2022-07-19 2023-09-01 四川大学 Unknown protocol reverse analysis method based on named entity recognition
CN116016690A (en) * 2022-12-02 2023-04-25 国家工业信息安全发展研究中心 Automatic reverse analysis method and system for industrial private protocol

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN108280357A (en) * 2018-01-31 2018-07-13 云易天成(北京)安全科技开发有限公司 Data leakage prevention method, system based on semantic feature extraction
CN109165383A (en) * 2018-08-09 2019-01-08 四川政资汇智能科技有限公司 A kind of data convergence, analysis, excavation and sharing method based on cloud platform

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024455A1 (en) * 2015-07-24 2017-01-26 Facebook, Inc. Expanding mutually exclusive clusters of users of an online system clustered based on a specified dimension
US11093711B2 (en) * 2016-09-28 2021-08-17 Microsoft Technology Licensing, Llc Entity-specific conversational artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN108280357A (en) * 2018-01-31 2018-07-13 云易天成(北京)安全科技开发有限公司 Data leakage prevention method, system based on semantic feature extraction
CN109165383A (en) * 2018-08-09 2019-01-08 四川政资汇智能科技有限公司 A kind of data convergence, analysis, excavation and sharing method based on cloud platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Text coherence new method using word2vec sentence vectors and most likely n-grams;Mohamad Abdolahi Kharazmi;《2017 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS)》;20180312;全文 *
基于语言模型和机器学习的文本情感分类研究;张奇;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20180415(第04期);全文 *

Also Published As

Publication number Publication date
CN109951464A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109951464B (en) Message sequence clustering method for unknown binary private protocol
US20230031738A1 (en) Taxpayer industry classification method based on label-noise learning
US10963685B2 (en) Generating variations of a known shred
CN107862046A (en) A kind of tax commodity code sorting technique and system based on short text similarity
CN115098620B (en) Cross-modal hash retrieval method for attention similarity migration
CN111274804A (en) Case information extraction method based on named entity recognition
CN112926045B (en) Group control equipment identification method based on logistic regression model
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN112800249A (en) Fine-grained cross-media retrieval method based on generation of countermeasure network
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
CN111158641A (en) Affair function point automatic identification method based on semantic analysis and text mining, corresponding storage medium and electronic device
CN110347827B (en) Event Extraction Method for Heterogeneous Text Operation and Maintenance Data
CN108519978A (en) A kind of Chinese document segmenting method based on Active Learning
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN111159377B (en) Attribute recall model training method, attribute recall model training device, electronic equipment and storage medium
CN112446205A (en) Sentence distinguishing method, device, equipment and storage medium
CN116842949A (en) Event extraction method, device, electronic equipment and storage medium
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN111881678B (en) Domain word discovery method based on unsupervised learning
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document
CN113222059A (en) Multi-label emotion classification method using cooperative neural network chain
CN102722489B (en) The system and method for extracting object identifier from webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230620

Address after: Floor 9, Building 3, Yougu, No. 12, Mozhou East Road, Moling Street, Jiangning District, Nanjing, Jiangsu 211111

Patentee after: NANJING CYBER PEACE INFORMATION TECHNOLOGY CO.,LTD.

Address before: 710071 No. 2 Taibai South Road, Shaanxi, Xi'an

Patentee before: XIDIAN University

TR01 Transfer of patent right