Based on the protocol classification method for improving AGNES algorithms
Technical field
The present invention relates to a kind of based on the protocol classification method for improving AGNES algorithms.
Background technology
The network information security and confrontation have become the major issue extremely paid close attention to the information age.In fields such as electronic countermeasures,
It is often used the bit stream that any special measures obtain intercommunication, the communication protocol that general communicating pair uses is customized, non-public
's.In addition, when using protocol analysis tool during network communication, the protocol bits stream that can not be parsed often is encountered;Solution
It is relatively difficult to analyse these totally unknown agreements, but for as fields such as network supervision, information protection, information stealths, knowing
Other unknown protocol is a vital job again;Therefore further identification communication is made from the bit stream sequence obtained
Unknown protocol is an important topic.
A kind of basic ideas of unknown protocol identification at present are for a certain unknown protocol, using data mining and pattern
Matched method finds the feature of the unknown protocol with data digging method, is then carried out with method for mode matching matching characteristic
Identification;Such method on condition that obtain single protocol data frame for study use, single protocol data frame to multi-protocol data frames into
Row cluster obtains, and needs to use hierarchical clustering algorithm, that is, AGNES algorithms.
Traditional AGNES algorithm ideas are:First using each object as a cluster, then according to one step of the criterion of setting
Cluster is merged into increasing cluster by one step, it is known that is met the cluster number intentionally got or other setting conditions, is usually merged
Similarity of the criterion between object between class cluster.
Traditional AGNES algorithms are described as follows:Input:Data set containing c object;The cluster number k intentionally got;
Output:K class cluster;Step:(1) using each object as a class cluster, total c is a;(2)Repeat;(3) according to distance criterion
Definition, finds two most like clusters;(4) merge two most like clusters, obtain the set of new cluster;(5) know and reach finger
The number k of fixed class cluster.
AGNES algorithms are simple, accuracy rate is high, but the algorithm does not have good scalability;Algorithm is in the selection for merging point
It is very crucial, if there is no preferable selection combining point in a certain step, it will have a direct impact on subsequent Clustering Effect.
Invention content
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of based on the agreement point for improving AGNES algorithms
Class method can automatically determine the number of cluster, and there are one similarity evaluation indexs for each class cluster of gained, and algorithm is poly-
Current cluster result can be investigated in class process, and the class cluster met is extracted in time.
The purpose of the present invention is achieved through the following technical solutions:Based on the protocol classification side for improving AGNES algorithms
Method, it includes the following steps:
S1. the data set DataSet for inputting n data object sets the minimum similarity lowestSimi, minimum of merging
Class cluster object number lowestSize and similarity reduce the value of step-length temp, wherein minimum merge similarity lowestSimi
Less than 1;
S2. by each object as an initial class cluster, and similarity reference threshold is set, similarity reference threshold
Similar=1;
S3. by i-th of class cluster in data set DataSet in addition to itself all class clusters be compared and cluster, wherein 1
≤ i≤n, and i is integer;
S4. the value of i is converted, cycle executes comparison and the cluster of step S3, until i takes integer all in 1~n
Value;
S5. judge whether the class cluster object number clustered in S3~S4 is more than minimum class cluster object number
lowestSize:
(1) the obtained class cluster object number of cluster is more than minimum class cluster object number lowestSize, by cluster result plus
Enter cluster result set clusterResultSet, the object number set that class cluster contains is added in the object number that class cluster contains
indexResultSet;Similarity evaluation set similarSet is added in the value of current similar, and jumps to step
S6;
(2) the class cluster object number that cluster obtains is not more than minimum class cluster object number lowestSize;It gos to step
S6;
S6. the value of similar is reduced, updated similar values take the similar values before update to subtract similarity reduction
Step-length temp, and judge whether the value of similar after update is more than and minimum merge similarity lowestSimi:
(1) value of similar merges similarity lowestSimi more than minimum after updating, and go to step S3;
(2) value of similar merges similarity lowestSimi no more than minimum after updating, and go to step S7;
S7. cluster terminates, and checks the remaining data frame that cannot merge, and is added into residue and does not form the data preferably clustered
Object set leftDataSet.
The step S3 includes following sub-step:
S31. by class cluster i respectively compared with all class clusters of the current data set DataSet in addition to itself, current number is found out
According to collection DataSet in the highest class cluster j of class cluster i similarities;Wherein, class cluster i is i-th of class in initial data set DataSet
Cluster;
S32. judge whether the similarity of class cluster i and class cluster j is more than the value of current similar;
(1) when the similarity of class cluster i and class cluster j is more than the value of current similar:Class cluster i is merged with class cluster j;
(2) when the similarity of class cluster i and class cluster j is not more than the value of current similar, go to step S4;
S33. j-th of class cluster is deleted from data set DataSet, updates the data collection DataSet, and the S31 that gos to step.
The step S31 includes following sub-step:
S311. it calculates separately class cluster i and asks similar respectively to all class clusters in current data set DataSet in addition to itself
Degree;
S312. current data is found out to concentrate and the highest class cluster j of class cluster i similarities.
Similarity between two class clusters is solved according to following formula:
p∈ci,p′∈cj
In formula, davg(ci, cj) indicate two class cluster ci, cjBetween similarity, niIndicate class cluster ciIn include data frame
Item number, njIndicate class cluster cjIn include data frame number, similarities of the d (p, p') between data frame p and data frame p';
There are two types of the method for solving of the d (p, p'):
Method one:Similarity between direct solution data frame solves d (p, p'):
d(p,p')=sam (p, p')/sum (p, p'),
Wherein sam (p, p') is by the following result for operating and obtaining:It is with left alignment, with nibble by data frame p, p'
Unit, from left to right to data frame p, the alignment characters of p' are compared, and the number for encountering the identical situation of alignment characters is
sam(p,p');And sum (p, p') is to calculate the number compared when sam (p, p');
Method two:A character string will be treated as per data frame, the similarity similar (p, p') between character string is
Required d (p, p'):
In formula, length (p), length (p') are respectively the length of character string p and character string p', Distance (p, p')
For the editing distance of two character strings, indicate character string p becoming what character string p' needed by insertion, replacement, delete operation
Number of operations.
The beneficial effects of the invention are as follows:(1) present invention can automatically determine the number of cluster by changing end condition, eventually
Only condition is:Under the similarity reference threshold of some setting, the class cluster that can not merge;The AGNES algorithms of the present invention are first
With high similarity cluster, then gradually reduce similarity until setting minimum similarity value.
(2) it is used as present invention uses different grades of similarity and refers to threshold value, to obtain the class cluster of different Clustering Effects,
There are one similarity evaluation indexs for each class cluster of gained, can intuitively find out which class cluster from cluster result in this way
It is preferably to cluster.
(3) invention specifies satisfied class cluster object number threshold value, cluster result is investigated in cluster process, when there is ratio
When relatively satisfactory cluster class cluster occurs (for example the object number of some class cluster is more than given threshold), such cluster is extracted and is added
Enter in result set;The preferable class cluster formed can be found in time, to prevent being destroyed by subsequent undesirable merging;Another party
Face can reduce influence of the bad merging to later step, improve the scalability of algorithm.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Specific implementation mode
Technical scheme of the present invention is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to
It is as described below.
As shown in Figure 1, based on the protocol classification method for improving AGNES algorithms, it includes the following steps:
S1. the data set DataSet for inputting n data object sets the minimum similarity lowestSimi, minimum of merging
Class cluster object number lowestSize and similarity reduce the value of step-length temp, wherein minimum merge similarity lowestSimi
Less than 1;
S2. by each object as an initial class cluster, and similarity reference threshold is set, similarity reference threshold
Similar=1;
S3. by i-th of class cluster in data set DataSet in addition to itself all class clusters be compared and cluster, wherein 1
≤ i≤n, and i is integer;
S4. the value of i is converted, cycle executes comparison and the cluster of step S3, until i takes integer all in 1~n
Value;
S5. judge whether the class cluster object number clustered in S3~S4 is more than minimum class cluster object number
lowestSize:
(1) the obtained class cluster object number of cluster is more than minimum class cluster object number lowestSize, by cluster result plus
Enter cluster result set clusterResultSet, the object number set that class cluster contains is added in the object number that class cluster contains
indexResultSet;Similarity evaluation set similarSet is added in the value of current similar, and jumps to step
S6;
(2) the class cluster object number that cluster obtains is not more than minimum class cluster object number lowestSize;It gos to step
S6;
S6. the value of similar is reduced, updated similar values take the similar values before update to subtract similarity reduction
Step-length temp, and judge whether the value of similar after update is more than and minimum merge similarity lowestSimi:
(1) value of similar merges similarity lowestSimi more than minimum after updating, and go to step S3;
(2) value of similar merges similarity lowestSimi no more than minimum after updating, and go to step S7;
S7. cluster terminates, and checks the remaining data frame that cannot merge, and is added into residue and does not form the data preferably clustered
Object set leftDataSet.
The step S3 includes following sub-step:
S31. by class cluster i respectively compared with all class clusters of the current data set DataSet in addition to itself, current number is found out
According to collection DataSet in the highest class cluster j of class cluster i similarities;Wherein, class cluster i is i-th of class in initial data set DataSet
Cluster;
S32. judge whether the similarity of class cluster i and class cluster j is more than the value of current similar;
(1) when the similarity of class cluster i and class cluster j is more than the value of current similar:Class cluster i is merged with class cluster j;
(2) when the similarity of class cluster i and class cluster j is not more than the value of current similar, go to step S4;
S33. j-th of class cluster is deleted from data set DataSet, updates the data collection DataSet, and the S31 that gos to step.
The step S31 includes following sub-step:
S311. it calculates separately class cluster i and asks similar respectively to all class clusters in current data set DataSet in addition to itself
Degree;
S312. current data is found out to concentrate and the highest class cluster j of class cluster i similarities.
Similarity between two class clusters is solved according to following formula:
p∈ci,p′∈cj
In formula, davg(ci, cj) indicate two class cluster ci, cjBetween similarity, niIndicate class cluster ciIn include data frame
Item number, njIndicate class cluster cjIn include data frame number, similarities of the d (p, p') between data frame p and data frame p';
There are two types of the method for solving of the d (p, p'):
Method one:Similarity between direct solution data frame solves d (p, p'):
d(p,p')=sam (p, p')/sum (p, p'),
Wherein sam (p, p') is by the following result for operating and obtaining:It is with left alignment, with nibble by data frame p, p'
Unit, from left to right to data frame p, the alignment characters of p' are compared, and the number for encountering the identical situation of alignment characters is
sam(p,p');And sum (p, p') is to calculate the number compared when sam (p, p');
Method two:A character string will be treated as per data frame, the similarity similar (p, p') between character string is
Required d (p, p'):
In formula, length (p), length (p') are respectively the length of character string p and character string p', Distance (p, p')
For the editing distance of two character strings, indicate character string p becoming what character string p' needed by insertion, replacement, delete operation
Number of operations.
What the tcpdump experimental data sets that the present invention is announced using Lincoln laboratory were tested, extract wherein 9
The binary data frame of kind agreement is tested as unknown protocol;It is respectively:Dns, http, ntp, rip, smtp, ssh,
Arp, llc, loop agreement.
It is first hexadecimal format by binary data stream, for arp, llc, loop link layer protocols take data frame
Preceding x (taking x=68) byte, remaining agreement removes ip stems and the heads tcp or udp (for the spy of agreement itself is more preferably presented
Sign) after take its preceding x (taking x=68) byte, the whole of inadequate x bytes to take;The value of x is selected by rule of thumb, and the value of x includes preferably
Whole characteristic informations of data frame, but should not be too big, it may include otherwise a large amount of data information, influence the accurate of result
Degree, and increase operand.
Randomly select each 300 of above 9 kinds of agreements, totally 2700 data frames form input data, by input sequence from 0 to
2699 number for it.
Embodiment one, arrange parameter is minimum to merge similarity lowestSimi=0.1, minimum class cluster object number
LowestSize=200, similarity reduce step-length temp=0.1, using the present invention is based on the protocol classifications for improving AGNES algorithms
Method, and using the method one in step S312, obtained experimental result such as following table 1-1:
Table 1-1 cluster experimental results (method one)
Data frame bar number of the similarity less than 0.1 has 162 between remaining class cluster;In remaining 162 data frame, with phase
It is more than the 0.1 class cluster merged like degree, does not meet 200 object requirements, but data frame bar number is as follows more than or equal to 5 class clusters
Table 1-2:
Table 1-2 experiment gained group clusters (method one)
Embodiment two, identical as one parameter setting of embodiment, arrange parameter is minimum to merge similarity lowestSimi=
0.1, minimum class cluster object number lowestSize=200, similarity reduce step-length temp=0.1, and using in step S312
Method two, obtained experimental result such as following table 2-1:
Table 2-1 cluster experimental results (method two)
Data frame of the similarity less than 0.1 has 58 between remaining class cluster.In remaining 58 data frame, with similarity
The class cluster merged more than 0.1 does not meet 200 object requirements, but data frame bar number is more than or equal to 5 class cluster such as following table 2-
2:
The group cluster (method two) of table 2-2 experiment gained
Comparative example one and embodiment two:Embodiment one the result shows that:Input is 9 classifications, has successfully clustered out 8
A, similarity is less than 0.1 between remaining 162 data, or does not reach the requirement of 200 objects;Overall cluster is correct
Rate is (2700-314-162)/(2700-162)=85.88%, and cluster result is preferable;Embodiment two the result shows that:Overall cluster
Accuracy be (2700-502-58)/(2700-58)=81.00%, slightly below test one as a result, but application method two phase
It calculates and more data frames can be clustered like degree, but simultaneously between processing similarity relatively low (such as 0.2 to 0.3)
It is easier to produce the cluster of mistake when data frame.
Embodiment three carries out cluster calculation to above 9 kinds of agreements using common clustering algorithm in weka tools, first uses
StringToWordVector makees data prediction, and it is 9 then to specify the cluster number parameter of each clustering algorithm, is used
Classes toclusters evaluation are assessed, each clustering algorithm is clustered 3 times using different random seeds,
As a result its average value is taken, obtained result is respectively as shown in following table 3-1,3-2 and 3-3:
The result of SimplKMeans clustering algorithms in table 3-1 weka
The result of sIB clustering algorithms in table 3-2 weka
The result of EM clustering algorithms in table 3-3 weka
Comparative example one and embodiment three, it can be seen that using the present invention is based on the protocol classifications for improving AGNES algorithms
Method, and using the method one in step S312, overall accuracy is higher than SimpleKMeans algorithms 20.43%, is calculated higher than sIB
Method 11.48% is higher than EM algorithms 13.08%.