CN109951464A - The sequence of message clustering method of unknown binary system proprietary protocol - Google Patents
The sequence of message clustering method of unknown binary system proprietary protocol Download PDFInfo
- Publication number
- CN109951464A CN109951464A CN201910173504.6A CN201910173504A CN109951464A CN 109951464 A CN109951464 A CN 109951464A CN 201910173504 A CN201910173504 A CN 201910173504A CN 109951464 A CN109951464 A CN 109951464A
- Authority
- CN
- China
- Prior art keywords
- message sequence
- sequence
- vector
- message
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 229910002056 binary alloy Inorganic materials 0.000 title abstract description 5
- 239000013598 vector Substances 0.000 claims description 112
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000012549 training Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 6
- 238000003064 k means clustering Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000013480 data collection Methods 0.000 claims description 3
- 238000007476 Maximum Likelihood Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 claims description 2
- 238000003780 insertion Methods 0.000 abstract 1
- 230000037431 insertion Effects 0.000 abstract 1
- 238000004891 communication Methods 0.000 description 14
- 238000004088 simulation Methods 0.000 description 14
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000007621 cluster analysis Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a kind of sequence of message clustering method of unknown binary system proprietary protocol, mainly solve the problems, such as that the prior art can not accurately measure similitude between protocol massages sequence in agreement reverse process.Its implementation are as follows: 1) acquire unknown binary system proprietary protocol sequence of message;2) sequence of message of acquisition is pre-processed;3) the multiple dimensioned N-gram feature of pretreatment sequence of message is extracted;4) dimensionality reduction is carried out to multiple dimensioned N-gram feature based on variance selection;5) doing insertion to sequence of message according to the multiple dimensioned N-gram feature after dimensionality reduction indicates;6) being embedded according to sequence of message indicates to determine best cluster number K;7) sequence of message is clustered according to best cluster number K.The present invention has sufficiently excavated the potential applications information of message packet sequence, can accurately measure the similitude between sequence of message, improve the accuracy of cluster, can be used for the cluster to unknown binary system proprietary protocol.
Description
Technical Field
The invention belongs to the technical field of information, and further relates to a message sequence clustering method which can be used for clustering unknown binary private protocols.
Background
Network protocols are specifications for the communication of entities in a network that specify the format of data and the associated synchronization issues when communicating entities exchange information with one another. In addition to standardized communication protocols, there are a number of unknown proprietary protocols in networks. Message sequence clustering is the first work in the protocol reverse process, namely separating messages of various types of private protocol message sequences to the maximum extent according to the similarity between message sequences, and then performing field format inference and state machine inference.
The core problem of private protocol message sequence clustering, i.e. network protocol identification, is how to accurately measure the similarity between message sequences. The current packet sequence clustering algorithm of the unknown private protocol can be roughly divided into three categories, namely a sequence clustering algorithm based on an edit distance, a sequence clustering algorithm based on a keyword and a sequence clustering algorithm based on a probability model. Edit distance measures similarity between sequences by the minimum number of operations required to change one string to another, including inserting, deleting, and replacing a character. The idea of finding the longest common subsequence in the edit distance algorithm and the Needleman-Wunsch algorithm is similar, and from the perspective of text matching, local features between sequences are ignored, and the local features may be a key for measuring similarity between sequences in a protocol cluster, namely, a protocol keyword. In a sequence clustering algorithm based on a probability model, modeling is difficult, and the method is very effective only in long sequence clustering calculation. The sequence clustering algorithm based on the keywords is an Apriori algorithm, and has the problems that a large number of frequently overlapped items appear, so that the dimensionality of a feature vector representing a message sequence is very large. In 2013, Wang Yipeng et al pioneering that the N-gram and Dirichlet distribution LDA model in natural language processing are introduced into protocol sequence clustering, the optimal value of N is determined by using Zipf of the Ziff law, and then modeling is performed by using LDA. The method ignores the length of the key word of the protocol message sequence, does not consider the semantic association characteristics before and after words when performing message embedding representation, and can not accurately measure the similarity between the message sequences and has poor clustering effect.
Disclosure of Invention
The invention aims to provide a message sequence clustering method of an unknown binary private protocol aiming at the defects of the prior art, so that the potential semantic information in the message is fully mined in the private protocol feature extraction process, and the clustering accuracy is improved.
The technical scheme of the invention is as follows: modeling a message sequence by using an N-gram language model, extracting multi-scale N-gram characteristics under the condition of not fixing an N value, and training a message sequence word vector embedding representation by using a word2vec model, wherein the implementation steps comprise the following steps:
(1) collecting an unknown binary private protocol message sequence by using a data collection method;
(2) preprocessing an acquired unknown binary private protocol message sequence:
(2a) stripping link layer and transmission layer data of an unknown binary private protocol message sequence by a network packet analysis technology to obtain application layer binary private protocol message sequence data;
(2b) converting the binary message sequence data of the application layer into hexadecimal message sequence data according to a binary conversion rule;
(2c) marking the hexadecimal message sequence data to generate a sample data set;
(3) extracting multi-scale N-gram characteristics of the sample data set:
(3a) determining the minimum value and the maximum value range of the N value;
(3b) taking N value in the range, and cutting the sample data set by using an N-gram model to obtain a word vector of the segmented message sequence of the message as the multi-scale N-gram characteristic of the sample data set;
(4) and (3) reducing the dimension of the multi-scale N-gram feature based on variance selection:
(4a) according to the word vector of the message sequence, carrying out One-Hot coding on the message sequence by utilizing the One-Hot coding to obtain a characteristic vector space model after the message sequence is coded;
(4b) calculating the variance distribution of each eigenvector according to the eigenvector space model;
(4c) reducing the dimension of the extracted multi-scale N-gram characteristics according to the variance distribution of each characteristic vector, namely selecting the characteristic vector with larger variance as a characteristic vector vocabulary of the sample data set;
(5) and embedding and expressing the message sequence according to the feature vector vocabulary:
(5a) screening word vectors of the message sequence by using the characteristic vector vocabulary, and only leaving the word vectors in the characteristic vector vocabulary as word vector characteristics of the message sequence;
(5b) taking the word vector characteristics of the sample training set as input, and training by using a word2vec model to obtain an embedded vector dictionary wv taking the weight matrix of the hidden layer of the shallow neural network as a vocabulary;
(5c) finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev;
(5d) Embedding vector E of each message sequencevNormalizing into unit vector to obtain embedded vector matrix X of message sequenceEv;
(6) Embedded vector matrix X of message sequence by using MeanShift probability density estimation methodEvPerforming modular point search to obtain the optimal clustering number K of the message sequence;
(7) clustering message sequences:
(7a) taking an embedded vector matrix of the message sequence as input, and dividing the message sequence into K sets by using a K-Means clustering method;
(7b) and respectively storing the message sequence data divided into each set.
Compared with the prior art, the invention has the following advantages:
firstly, the invention carries out multi-scale N-gram feature extraction under the condition of not fixing the N value, and overcomes the problem of embedding representation of unequal lengths of keywords of message sequences.
Secondly, the word table is embedded and expressed by a word2vec model, and semantic association characteristics of front and rear words are combined on the aspect of determining the weight of the keywords, so that the method can fully mine potential semantic information in the message, and further improve the accuracy of clustering.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The specific steps of the present invention are further described below with reference to fig. 1.
Step 1, collecting an unknown binary private protocol message sequence by using a data collection method.
(1a) Setting a network card mode of the server acquisition equipment into a hybrid mode, enabling the network card mode to monitor wireless communication data, and then opening a communication entity A and a communication entity B to establish communication connection;
(1b) and intercepting message sequence communication data between the communication entities A and B by using wireshark software, and storing the message sequence communication data as a pcap format file to obtain an unknown binary private protocol message sequence, wherein the message sequence comprises link layer data, transmission layer data and application layer data.
And 2, preprocessing the acquired unknown binary private protocol message sequence.
(2a) Analyzing an intercepted unknown binary private protocol message sequence according to the structure of a network data packet, namely stripping link layer data and transmission layer data contained in the message sequence to obtain application layer data of the message sequence, wherein the application layer data is in a binary format;
(2b) according to the conversion rule between the binary systems, for example, binary 1111 corresponds to hexadecimal as F, the application layer binary message sequence data is converted into hexadecimal message sequence data.
(2c) And marking the hexadecimal message sequence data to generate a sample data set.
And 3, extracting the multi-scale N-gram characteristics of the sample data set.
The N-gram model is a natural language processing model based on string statistics, based on the assumption that the nth word occurs onlyIs related to the first n-1 words and is not related to any other words. The probability of occurrence of the entire sequence is equal to the product of the probabilities of occurrence of the individual words, assuming that the sequence T is formed by the sequence of words ω1,ω2,...,ωnAnd (3) forming, wherein the probability of occurrence of the sequence T is as follows:
P(T)=p(ω1)×p(ω2)×...×p(ωn)
=p(ω1)×p(ω2|ω1)×...×p(ωn|ω1ω2ω3...),
wherein P (T) is the probability of occurrence of the sequence T, P (. omega.)i) Is the word omegaiThe probability of occurrence.
The selection of the N value in the N-gram model is very important, the integrity of segmentation data can be ensured when N is larger, but the effectiveness is reduced, and complete lexical information cannot be contained in the word segmentation process when N is too small. The fixed value of N can be extended to a range of values that can take on multiple values.
The specific implementation of extracting the multi-scale N-gram characteristics of the sample data set in the step is as follows:
(3a) the minimum value and the maximum value range of the N value are determined and are generally set to be 2-5.
(3b) Taking N values in the range, and respectively segmenting the sample data set by using an N-gram model to obtain segmented word vectors of the message sequence, for example, when N is 2, for the message sequence '020 a', the segmented word vectors can be obtained as '02200 a';
(3c) and combining the word vectors obtained under different N values to obtain the segmented word vectors of the message sequence of the message, wherein the word vectors are used as the multi-scale N-gram characteristics of the sample data set.
And 4, reducing the dimension of the multi-scale N-gram characteristics based on the variance selection.
The feature selection method can be divided into three types according to the form of feature selection: (1) a filtering method, which scores each feature according to the divergence or the correlation, sets a scoring threshold value or the number of features to be selected, and selects the features; (2) a packaging method, selecting a plurality of characteristics each time according to the target function, or excluding a plurality of characteristics; (3) in the integration method, certain machine learning algorithms and models are used for training to obtain weight coefficients of all the features, and the features are selected according to the coefficients from large to small.
The invention uses but is not limited to the variance selection in the filtering method to reduce the dimension of the characteristic, and the specific implementation is as follows:
(4a) encoding a message packet sequence word vector according to a One-Hot encoding rule, One-Hot encoding using an N-bit state register to encode N states, each state having its own independent register bits and only One of which is valid at any time, e.g., for a feature vector vocabulary "a 0b0c0a00005 … …", feature "a 0" may be encoded as [ 10000 … … ", feature" b0 "as [ 01000 … … ], an embedded vector of the message packet sequence" a 000055 e e 00 cc 00445 a 0000505 e 5e0e0c 0c0c04045 "is denoted as [ 10011 … … ];
(4b) combining the embedded vectors after each message sequence is coded to obtain a characteristic vector space model;
(4c) calculating the variance distribution of each feature vector according to the feature vector space model:
whereinFor the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean value, and N is the total number of the sample data set;
(4d) and setting a score threshold value of the variance selection, and selecting the feature vector with the variance larger than the threshold value as a feature vector vocabulary of the sample data set.
And 5, training the word vector characteristics of the message sequence by using the word2vec model according to the characteristic vector vocabulary to obtain a word embedded vector dictionary wv.
The word2vec model is a shallow neural network correlation model used for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words at adjacent positions so as to represent semantic relation between word pairs, and the vector is a hidden layer of the shallow neural network.
The specific implementation of this step is as follows:
(5a) screening word vectors of the message sequence by using the feature vector vocabulary table, and only leaving the word vectors in the feature vector vocabulary table as word vector features of the message sequence, for example, the feature vector vocabulary table is "a 0b0c0a00005 … …", and for the message sequence word vector "a 000055 e e 00 c c 00445 a 0000505 e 5e0e0c 0c0c 04045", the word vector of the message sequence after screening is "a 0c0a 00005";
(5b) taking the word vector characteristics of the sample training set as input, training by using a word2vec model to obtain a weight matrix of a shallow neural network hidden layer, and taking the weight matrix as an embedded vector dictionary wv of words.
And 6, carrying out embedded representation on the message sequence.
(6a) Finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev:
Wherein EvEmbedding for each message sequenceVector representation, wv [ w ]]Representing the embedded vector corresponding to each vocabulary, wherein M is the number of vocabularies contained in each message sequence;
(6b) embedding vector E of each message sequencevNormalized into unit vector, and then combined to obtain embedded vector matrix of message sequence
Step 7, utilizing the MeanShift probability density estimation method to embed vector matrix of message sequenceAnd performing modular point search to obtain the optimal clustering number K of the message sequence.
(7a) Randomly selecting an embedded vector matrixThe point in (1) is taken as a starting point s;
(7b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range hiThe sum of the required offsets and averaging to obtain the average offset
(7c) Shift point s along the averageIs moved to a new point s', the length of the movement being the average offsetDie length of
(7d) Repeating (6a) with the moved new point s' as a new starting point) (6c) and continuously iterating until the average offsetAnd if the value is smaller than the set threshold value or the iteration times are reached, the obtained new point s' is a clustering center, and the number of the clustering centers is the optimal clustering number K.
And 8, clustering the message sequence by using a K-Means clustering method.
In the K-Means clustering, sample data is divided into K clusters, so that each point belongs to a nearest mean value, namely a class corresponding to a clustering center, and the specific implementation is as follows:
(8a) randomly selecting an embedded vector matrixThe medium K points are used as initial clustering centers: { u1,u2,...,uk};
(8b) Computing an embedded vector matrixAll points x iniAnd cluster center ujEuropean distance ofX is to beiClass λ corresponding to Euclidean distance d marked as minimumiAt this time, the cluster is updated
(8c) Updating the clustering center ui', the mean of all points in each cluster is taken as the new cluster center for the cluster:
(8d) by ui' repeat (8a) - (8d) for new cluster centers, and iterate untilThe clustering center is not changed any more or the iteration times are reached;
(8e) and respectively storing the message sequence data divided into each cluster.
The effect of the present invention will be further described with reference to simulation experiments.
1. Simulation experiment conditions are as follows:
in the simulation experiment, two communication entities are a Dobby pocket unmanned aerial vehicle and a ground station which are delivered from ZeroTech, and communication data between the two communication entities is intercepted and used as message sequence data of an unknown binary private protocol in the simulation experiment.
The simulation conditions of the server in the simulation experiment of the invention are that the protocol message sequence is clustered and analyzed on a computer with Intel (R) core (TM) i5-8250U CPU processor, Windows 10 operating system and memory 16GB, and the simulation experiment result is tested.
2. Simulation experiment content and result analysis:
the simulation experiment of the invention is to collect the protocol message sequence data of both communication entities, and to extract the characteristics and perform cluster analysis on the server. The simulation experiment comprises the following specific steps:
step 1, collecting protocol message sequence data of both communication entities, selecting 2000 protocol message sequences for data annotation;
step 2, preprocessing the protocol message sequence data to generate application layer hexadecimal message sequence data;
step 3, setting the maximum value N of the range of N valuesmax;
Step 4, segmenting and combining the message sequence according to the set N value threshold range to obtain a word vector of the message sequence;
step 5, performing one-hot encoding on the message sequence word vectors, calculating the variance distribution of the feature vectors, selecting the threshold parameter of 0.8 in the simulation experiment according to the threshold value, and selecting the feature with the variance larger than the threshold value as a feature vector vocabulary;
step 6, screening word vectors of the message sequence according to the vocabulary table, taking the word vectors as input, and training by using a word2vec model to obtain an embedded vector dictionary wv of the vocabulary;
7, finding out the embedded vector representation wv [ w ] of the vocabulary in the embedded vector dictionary wv for the vocabulary w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev;
Step 8, embedding vector E for each message sequencevNormalizing into unit vector to obtain embedded vector matrix of message sequence
Step 9, determining the optimal clustering number K of the message sequence by using MeanShift modular point search;
and step 10, clustering the message sequence by using a K-Means clustering method according to the optimal clustering number K to obtain the class division of the message sequence, wherein the result is shown in a table 1 and is stored.
Table 1: summary of simulation test results
Number of messages | Ratio of | Representation format | |
Class 1 | 1484 | 74.20% | a018* |
Class 2 | 295 | 14.75% | a010* |
Class 3 | 147 | 7.35% | a00f* |
Class 4 | 74 | 3.70% | a005* |
As can be seen from table 1, the simulation experiment result divides the collected 2000 protocol message sequence data into 4 categories, and each category message sequence representation format is "a 018 ×," a010 ×, "a 00f ×," a005 × "respectively, and is consistent with manual labeling.
3. Accuracy analysis of simulation experiment:
to demonstrate the effectiveness of the cluster analysis of the present invention, the threshold N at different ranges of N values was calculated by the following formulamaxClustering accuracy under the conditions, results are shown in table 2.
Table 2: accuracy list of message sequence clustering analysis
Nmax | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Accuracy of | 0.5383 | 0.7189 | 0.7247 | 0.9940 | 0.5301 | 0.9658 | 0.6019 | 0.6450 |
As can be seen from Table 2, when N is presentmaxWhen the accuracy is 5, the accuracy is highest and reaches 99.40%. The result verifies the effectiveness of the invention, and shows that the method can be used as an effective unknown binary private protocol message sequenceAnd (4) clustering method.
Claims (7)
1. A message sequence clustering method of an unknown binary private protocol is characterized by comprising the following steps:
(1) collecting an unknown binary private protocol message sequence by using a data collection method;
(2) preprocessing an acquired unknown binary private protocol message sequence:
(2a) stripping link layer and transmission layer data of an unknown binary private protocol message sequence by a network packet analysis technology to obtain application layer binary private protocol message sequence data;
(2b) converting the binary message sequence data of the application layer into hexadecimal message sequence data according to a binary conversion rule;
(2c) marking the hexadecimal message sequence data to generate a sample data set;
(3) extracting multi-scale N-gram characteristics of the sample data set:
(3a) determining the minimum value and the maximum value range of the N value;
(3b) taking N value in the range, and segmenting the sample data set by using an N-gram model to obtain segmented message sequence word vectors serving as multi-scale N-gram characteristics of the sample data set;
(4) and (3) reducing the dimension of the multi-scale N-gram feature based on variance selection:
(4a) according to the word vector of the message sequence, carrying out One-Hot coding on the message sequence by utilizing the One-Hot coding to obtain a characteristic vector space model after the message sequence is coded;
(4b) calculating the variance distribution of each eigenvector according to the eigenvector space model;
(4c) reducing the dimension of the extracted multi-scale N-gram characteristics according to the variance distribution of each characteristic vector, namely selecting the characteristic vector with larger variance as a characteristic vector vocabulary of the sample data set;
(5) and embedding and expressing the message sequence according to the feature vector vocabulary:
(5a) screening word vectors of the message sequence by using the characteristic vector vocabulary, and only leaving the word vectors in the characteristic vector vocabulary as word vector characteristics of the message sequence;
(5b) taking the word vector characteristics of the sample training set as input, and training by using a word2vec model to obtain an embedded vector dictionary wv taking the weight matrix of the hidden layer of the shallow neural network as a vocabulary;
(5c) finding out the embedded vector representation wv [ w ] of the word in the embedded vector dictionary wv for the word w in each message sequence]Adding and averaging to obtain embedded vector representation E of each message sequencev;
(5d) Embedding vector E of each message sequencevNormalizing into unit vector to obtain embedded vector matrix of message sequence
(6) Embedded vector matrix of message sequence by using MeanShift probability density estimation methodPerforming modular point search to obtain the optimal clustering number K of the message sequence;
(7) clustering message sequences:
(7a) taking an embedded vector matrix of the message sequence as input, and dividing the message sequence into K sets by using a K-Means clustering method;
(7b) and respectively storing the message sequence data divided into each set.
2. The method according to claim 1, wherein the binary private protocol packet sequence in (1) comprises link layer data, transport layer data, and application layer data.
3. The method of claim 1, wherein the One-Hot encoding in (4a) is performed using an N-bit status register to encode N states, each state having its own independent register bit and only One of which is active at any time.
4. The method of claim 1, wherein the variance distribution of each feature vector is calculated in (4b) by the following formula:
wherein,for the variance of each feature vector, x is the value of each feature vector in the sample data set, u is the mean, and N is the total number of sample data sets.
5. The method of claim 1, wherein the word2vec model in (5b) is a shallow neural network correlation model for generating word vectors, each word is mapped to an embedded vector by predicting the maximum likelihood probability of input words in neighboring positions to represent semantic relationships between word-to-word, and the vector is a hidden layer of the shallow neural network.
6. The method of claim 1, wherein the embedded vector matrix of the message sequence in (6) is estimated by using MeanShift probability densityAnd performing a mode point search, and realizing the following steps:
(6a) randomly selecting an embedded vector matrixThe point in (1) is taken as a starting point s;
(6b) setting a search radius h, and calculating the movement of a point s to each point x in the radius range hiThe sum of the required offsets and averaging to obtain the average offset
(6c) Shift point s along the averageIs moved to a new point s', the length of the movement being the average offsetDie length of
(6d) Repeating (6a) - (6c) with the moved new point s' as a new starting point, and repeating the steps until the average offset is reachedAnd if the value is smaller than the set threshold value or the iteration times are reached, the obtained new point s' is a clustering center, and the number of the clustering centers is the optimal clustering number K.
7. The method according to claim 1, wherein said (7a) uses K-Means clustering to divide the message packet sequence into K sets, as follows:
(7a) randomly selecting an embedded vector matrixThe medium K points are used as initial clustering centers: { u1,u2,...,uk};
(7b) Computing an embedded vector matrixAll points x iniAnd cluster center ujEuropean distance ofX is to beiClass λ corresponding to Euclidean distance d marked as minimumiAt this time, the cluster is updated
(7c) Updating clustering center u'iTaking the average value of all points in each cluster as the new cluster center of the cluster:
(7d) u'iRepeating (8a) - (8d) for the new cluster center, and continuously iterating until the cluster center is not changed any more or the iteration number is reached.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910173504.6A CN109951464B (en) | 2019-03-07 | 2019-03-07 | Message sequence clustering method for unknown binary private protocol |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910173504.6A CN109951464B (en) | 2019-03-07 | 2019-03-07 | Message sequence clustering method for unknown binary private protocol |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109951464A true CN109951464A (en) | 2019-06-28 |
CN109951464B CN109951464B (en) | 2021-05-14 |
Family
ID=67008531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910173504.6A Active CN109951464B (en) | 2019-03-07 | 2019-03-07 | Message sequence clustering method for unknown binary private protocol |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109951464B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110602073A (en) * | 2019-09-02 | 2019-12-20 | 西安电子科技大学 | Unmanned aerial vehicle flight control protocol field division method based on information theory |
CN112367325A (en) * | 2020-11-13 | 2021-02-12 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
CN112398865A (en) * | 2020-11-20 | 2021-02-23 | 苏州新网天盾科技有限公司 | Application layer information reasoning method under multilayer protocol nesting condition |
CN114722961A (en) * | 2022-04-20 | 2022-07-08 | 重庆邮电大学 | Mixed data frame clustering method of binary protocol under zero knowledge |
CN114724069A (en) * | 2022-04-09 | 2022-07-08 | 北京天防安全科技有限公司 | Video equipment model confirming method, device, equipment and medium |
CN115334179A (en) * | 2022-07-19 | 2022-11-11 | 四川大学 | Unknown protocol reverse analysis method based on named entity recognition |
CN116016690A (en) * | 2022-12-02 | 2023-04-25 | 国家工业信息安全发展研究中心 | Automatic reverse analysis method and system for industrial private protocol |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
US20170024455A1 (en) * | 2015-07-24 | 2017-01-26 | Facebook, Inc. | Expanding mutually exclusive clusters of users of an online system clustered based on a specified dimension |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN107273352A (en) * | 2017-06-07 | 2017-10-20 | 北京理工大学 | A kind of word insertion learning model and training method based on Zolu functions |
US20180089164A1 (en) * | 2016-09-28 | 2018-03-29 | Microsoft Technology Licensing, Llc | Entity-specific conversational artificial intelligence |
CN108280357A (en) * | 2018-01-31 | 2018-07-13 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system based on semantic feature extraction |
CN109165383A (en) * | 2018-08-09 | 2019-01-08 | 四川政资汇智能科技有限公司 | A kind of data convergence, analysis, excavation and sharing method based on cloud platform |
-
2019
- 2019-03-07 CN CN201910173504.6A patent/CN109951464B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170024455A1 (en) * | 2015-07-24 | 2017-01-26 | Facebook, Inc. | Expanding mutually exclusive clusters of users of an online system clustered based on a specified dimension |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
US20180089164A1 (en) * | 2016-09-28 | 2018-03-29 | Microsoft Technology Licensing, Llc | Entity-specific conversational artificial intelligence |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN107273352A (en) * | 2017-06-07 | 2017-10-20 | 北京理工大学 | A kind of word insertion learning model and training method based on Zolu functions |
CN108280357A (en) * | 2018-01-31 | 2018-07-13 | 云易天成(北京)安全科技开发有限公司 | Data leakage prevention method, system based on semantic feature extraction |
CN109165383A (en) * | 2018-08-09 | 2019-01-08 | 四川政资汇智能科技有限公司 | A kind of data convergence, analysis, excavation and sharing method based on cloud platform |
Non-Patent Citations (2)
Title |
---|
MOHAMAD ABDOLAHI KHARAZMI: "Text coherence new method using word2vec sentence vectors and most likely n-grams", 《2017 3RD IRANIAN CONFERENCE ON INTELLIGENT SYSTEMS AND SIGNAL PROCESSING (ICSPIS)》 * |
张奇: "基于语言模型和机器学习的文本情感分类研究", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110602073A (en) * | 2019-09-02 | 2019-12-20 | 西安电子科技大学 | Unmanned aerial vehicle flight control protocol field division method based on information theory |
CN110602073B (en) * | 2019-09-02 | 2021-05-18 | 西安电子科技大学 | Unmanned aerial vehicle flight control protocol field division method based on information theory |
CN112367325A (en) * | 2020-11-13 | 2021-02-12 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
CN112367325B (en) * | 2020-11-13 | 2023-11-07 | 中国人民解放军陆军工程大学 | Unknown protocol message clustering method and system based on closed frequent item mining |
CN112398865A (en) * | 2020-11-20 | 2021-02-23 | 苏州新网天盾科技有限公司 | Application layer information reasoning method under multilayer protocol nesting condition |
CN114724069A (en) * | 2022-04-09 | 2022-07-08 | 北京天防安全科技有限公司 | Video equipment model confirming method, device, equipment and medium |
CN114722961A (en) * | 2022-04-20 | 2022-07-08 | 重庆邮电大学 | Mixed data frame clustering method of binary protocol under zero knowledge |
CN115334179A (en) * | 2022-07-19 | 2022-11-11 | 四川大学 | Unknown protocol reverse analysis method based on named entity recognition |
CN115334179B (en) * | 2022-07-19 | 2023-09-01 | 四川大学 | Unknown protocol reverse analysis method based on named entity recognition |
CN116016690A (en) * | 2022-12-02 | 2023-04-25 | 国家工业信息安全发展研究中心 | Automatic reverse analysis method and system for industrial private protocol |
Also Published As
Publication number | Publication date |
---|---|
CN109951464B (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109951464B (en) | Message sequence clustering method for unknown binary private protocol | |
CN107085581B (en) | Short text classification method and device | |
US10963685B2 (en) | Generating variations of a known shred | |
CN111506599B (en) | Industrial control equipment identification method and system based on rule matching and deep learning | |
CN107862046A (en) | A kind of tax commodity code sorting technique and system based on short text similarity | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN111177367B (en) | Case classification method, classification model training method and related products | |
CN112926045B (en) | Group control equipment identification method based on logistic regression model | |
CN116049412B (en) | Text classification method, model training method, device and electronic equipment | |
CN113590810B (en) | Abstract generation model training method, abstract generation device and electronic equipment | |
CN111159377B (en) | Attribute recall model training method, attribute recall model training device, electronic equipment and storage medium | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
CN112800249A (en) | Fine-grained cross-media retrieval method based on generation of countermeasure network | |
CN115344693B (en) | Clustering method based on fusion of traditional algorithm and neural network algorithm | |
CN113486173A (en) | Text labeling neural network model and labeling method thereof | |
CN108519978A (en) | A kind of Chinese document segmenting method based on Active Learning | |
CN115953123A (en) | Method, device and equipment for generating robot automation flow and storage medium | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN112711944B (en) | Word segmentation method and system, and word segmentation device generation method and system | |
CN113222059A (en) | Multi-label emotion classification method using cooperative neural network chain | |
CN116069947A (en) | Log data event map construction method, device, equipment and storage medium | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method | |
CN103744830A (en) | Semantic analysis based identification method of identity information in EXCEL document | |
CN114021658A (en) | Training method, application method and system of named entity recognition model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230620 Address after: Floor 9, Building 3, Yougu, No. 12, Mozhou East Road, Moling Street, Jiangning District, Nanjing, Jiangsu 211111 Patentee after: NANJING CYBER PEACE INFORMATION TECHNOLOGY CO.,LTD. Address before: 710071 No. 2 Taibai South Road, Shaanxi, Xi'an Patentee before: XIDIAN University |