CN105183780B

CN105183780B - Based on the protocol classification method for improving AGNES algorithms

Info

Publication number: CN105183780B
Application number: CN201510492631.4A
Authority: CN
Inventors: 刘渊; 张春瑞; 赵强; 孟凡治; 岳旸
Original assignee: COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Current assignee: COMPUTER APPLICATION INST CHINA ENGINEERING PHYSICS ACADEMY
Priority date: 2015-08-12
Filing date: 2015-08-12
Publication date: 2018-09-18
Anticipated expiration: 2035-08-12
Also published as: CN105183780A

Abstract

The invention discloses a kind of based on the protocol classification method for improving AGNES algorithms, includes the following steps：Input data set, arrange parameter；Initial similarity reference threshold is set；Similarity-rough set is simultaneously clustered, and the cluster result for the condition that meets is put into result set；It reduces and updates similarity reference threshold, be compared cluster again, until similarity reference threshold merges similarity no more than minimum.The present invention provides a kind of based on the protocol classification method for improving AGNES algorithms, the number of cluster can be automatically determined, there are one similarity evaluation index, algorithms can investigate current cluster result in cluster process, the class cluster for the condition that met is extracted in time for each class cluster of gained.

Description

Based on the protocol classification method for improving AGNES algorithms

Technical field

The present invention relates to a kind of based on the protocol classification method for improving AGNES algorithms.

Background technology

The network information security and confrontation have become the major issue extremely paid close attention to the information age.In fields such as electronic countermeasures, It is often used the bit stream that any special measures obtain intercommunication, the communication protocol that general communicating pair uses is customized, non-public 's.In addition, when using protocol analysis tool during network communication, the protocol bits stream that can not be parsed often is encountered；Solution It is relatively difficult to analyse these totally unknown agreements, but for as fields such as network supervision, information protection, information stealths, knowing Other unknown protocol is a vital job again；Therefore further identification communication is made from the bit stream sequence obtained Unknown protocol is an important topic.

A kind of basic ideas of unknown protocol identification at present are for a certain unknown protocol, using data mining and pattern Matched method finds the feature of the unknown protocol with data digging method, is then carried out with method for mode matching matching characteristic Identification；Such method on condition that obtain single protocol data frame for study use, single protocol data frame to multi-protocol data frames into Row cluster obtains, and needs to use hierarchical clustering algorithm, that is, AGNES algorithms.

Traditional AGNES algorithm ideas are：First using each object as a cluster, then according to one step of the criterion of setting Cluster is merged into increasing cluster by one step, it is known that is met the cluster number intentionally got or other setting conditions, is usually merged Similarity of the criterion between object between class cluster.

Traditional AGNES algorithms are described as follows：Input：Data set containing c object；The cluster number k intentionally got； Output：K class cluster；Step：(1) using each object as a class cluster, total c is a；(2)Repeat；(3) according to distance criterion Definition, finds two most like clusters；(4) merge two most like clusters, obtain the set of new cluster；(5) know and reach finger The number k of fixed class cluster.

AGNES algorithms are simple, accuracy rate is high, but the algorithm does not have good scalability；Algorithm is in the selection for merging point It is very crucial, if there is no preferable selection combining point in a certain step, it will have a direct impact on subsequent Clustering Effect.

Invention content

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of based on the agreement point for improving AGNES algorithms Class method can automatically determine the number of cluster, and there are one similarity evaluation indexs for each class cluster of gained, and algorithm is poly- Current cluster result can be investigated in class process, and the class cluster met is extracted in time.

The purpose of the present invention is achieved through the following technical solutions：Based on the protocol classification side for improving AGNES algorithms Method, it includes the following steps：

S1. the data set DataSet for inputting n data object sets the minimum similarity lowestSimi, minimum of merging Class cluster object number lowestSize and similarity reduce the value of step-length temp, wherein minimum merge similarity lowestSimi Less than 1；

S2. by each object as an initial class cluster, and similarity reference threshold is set, similarity reference threshold Similar=1；

S3. by i-th of class cluster in data set DataSet in addition to itself all class clusters be compared and cluster, wherein 1 ≤ i≤n, and i is integer；

S4. the value of i is converted, cycle executes comparison and the cluster of step S3, until i takes integer all in 1~n Value；

S5. judge whether the class cluster object number clustered in S3~S4 is more than minimum class cluster object number lowestSize：

(1) the obtained class cluster object number of cluster is more than minimum class cluster object number lowestSize, by cluster result plus Enter cluster result set clusterResultSet, the object number set that class cluster contains is added in the object number that class cluster contains indexResultSet；Similarity evaluation set similarSet is added in the value of current similar, and jumps to step S6；

(2) the class cluster object number that cluster obtains is not more than minimum class cluster object number lowestSize；It gos to step S6；

S6. the value of similar is reduced, updated similar values take the similar values before update to subtract similarity reduction Step-length temp, and judge whether the value of similar after update is more than and minimum merge similarity lowestSimi：

(1) value of similar merges similarity lowestSimi more than minimum after updating, and go to step S3；

(2) value of similar merges similarity lowestSimi no more than minimum after updating, and go to step S7；

S7. cluster terminates, and checks the remaining data frame that cannot merge, and is added into residue and does not form the data preferably clustered Object set leftDataSet.

The step S3 includes following sub-step：

S31. by class cluster i respectively compared with all class clusters of the current data set DataSet in addition to itself, current number is found out According to collection DataSet in the highest class cluster j of class cluster i similarities；Wherein, class cluster i is i-th of class in initial data set DataSet Cluster；

S32. judge whether the similarity of class cluster i and class cluster j is more than the value of current similar；

(1) when the similarity of class cluster i and class cluster j is more than the value of current similar：Class cluster i is merged with class cluster j；

(2) when the similarity of class cluster i and class cluster j is not more than the value of current similar, go to step S4；

S33. j-th of class cluster is deleted from data set DataSet, updates the data collection DataSet, and the S31 that gos to step.

The step S31 includes following sub-step：

S311. it calculates separately class cluster i and asks similar respectively to all class clusters in current data set DataSet in addition to itself Degree；

S312. current data is found out to concentrate and the highest class cluster j of class cluster i similarities.

Similarity between two class clusters is solved according to following formula：

p∈c_i,p′∈c_j

In formula, d_avg(c_i, c_j) indicate two class cluster c_i, c_jBetween similarity, n_iIndicate class cluster c_iIn include data frame Item number, n_jIndicate class cluster c_jIn include data frame number, similarities of the d (p, p') between data frame p and data frame p'；

There are two types of the method for solving of the d (p, p')：

Method one：Similarity between direct solution data frame solves d (p, p')：

d_(p,p')=sam (p, p')/sum (p, p'),

Wherein sam (p, p') is by the following result for operating and obtaining：It is with left alignment, with nibble by data frame p, p' Unit, from left to right to data frame p, the alignment characters of p' are compared, and the number for encountering the identical situation of alignment characters is sam(p,p')；And sum (p, p') is to calculate the number compared when sam (p, p')；

Method two：A character string will be treated as per data frame, the similarity similar (p, p') between character string is Required d (p, p')：

In formula, length (p), length (p') are respectively the length of character string p and character string p', Distance (p, p') For the editing distance of two character strings, indicate character string p becoming what character string p' needed by insertion, replacement, delete operation Number of operations.

The beneficial effects of the invention are as follows：(1) present invention can automatically determine the number of cluster by changing end condition, eventually Only condition is：Under the similarity reference threshold of some setting, the class cluster that can not merge；The AGNES algorithms of the present invention are first With high similarity cluster, then gradually reduce similarity until setting minimum similarity value.

(2) it is used as present invention uses different grades of similarity and refers to threshold value, to obtain the class cluster of different Clustering Effects, There are one similarity evaluation indexs for each class cluster of gained, can intuitively find out which class cluster from cluster result in this way It is preferably to cluster.

(3) invention specifies satisfied class cluster object number threshold value, cluster result is investigated in cluster process, when there is ratio When relatively satisfactory cluster class cluster occurs (for example the object number of some class cluster is more than given threshold), such cluster is extracted and is added Enter in result set；The preferable class cluster formed can be found in time, to prevent being destroyed by subsequent undesirable merging；Another party Face can reduce influence of the bad merging to later step, improve the scalability of algorithm.

Description of the drawings

Fig. 1 is the flow chart of the present invention.

Specific implementation mode

Technical scheme of the present invention is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.

As shown in Figure 1, based on the protocol classification method for improving AGNES algorithms, it includes the following steps：

The step S3 includes following sub-step：

The step S31 includes following sub-step：

p∈c_i,p′∈c_j

There are two types of the method for solving of the d (p, p')：

Method one：Similarity between direct solution data frame solves d (p, p')：

d_(p,p')=sam (p, p')/sum (p, p'),

What the tcpdump experimental data sets that the present invention is announced using Lincoln laboratory were tested, extract wherein 9 The binary data frame of kind agreement is tested as unknown protocol；It is respectively：Dns, http, ntp, rip, smtp, ssh, Arp, llc, loop agreement.

It is first hexadecimal format by binary data stream, for arp, llc, loop link layer protocols take data frame Preceding x (taking x=68) byte, remaining agreement removes ip stems and the heads tcp or udp (for the spy of agreement itself is more preferably presented Sign) after take its preceding x (taking x=68) byte, the whole of inadequate x bytes to take；The value of x is selected by rule of thumb, and the value of x includes preferably Whole characteristic informations of data frame, but should not be too big, it may include otherwise a large amount of data information, influence the accurate of result Degree, and increase operand.

Randomly select each 300 of above 9 kinds of agreements, totally 2700 data frames form input data, by input sequence from 0 to 2699 number for it.

Embodiment one, arrange parameter is minimum to merge similarity lowestSimi=0.1, minimum class cluster object number LowestSize=200, similarity reduce step-length temp=0.1, using the present invention is based on the protocol classifications for improving AGNES algorithms Method, and using the method one in step S312, obtained experimental result such as following table 1-1：

Table 1-1 cluster experimental results (method one)

Data frame bar number of the similarity less than 0.1 has 162 between remaining class cluster；In remaining 162 data frame, with phase It is more than the 0.1 class cluster merged like degree, does not meet 200 object requirements, but data frame bar number is as follows more than or equal to 5 class clusters Table 1-2：

Table 1-2 experiment gained group clusters (method one)

Embodiment two, identical as one parameter setting of embodiment, arrange parameter is minimum to merge similarity lowestSimi= 0.1, minimum class cluster object number lowestSize=200, similarity reduce step-length temp=0.1, and using in step S312 Method two, obtained experimental result such as following table 2-1：

Table 2-1 cluster experimental results (method two)

Data frame of the similarity less than 0.1 has 58 between remaining class cluster.In remaining 58 data frame, with similarity The class cluster merged more than 0.1 does not meet 200 object requirements, but data frame bar number is more than or equal to 5 class cluster such as following table 2- 2：

The group cluster (method two) of table 2-2 experiment gained

Comparative example one and embodiment two：Embodiment one the result shows that：Input is 9 classifications, has successfully clustered out 8 A, similarity is less than 0.1 between remaining 162 data, or does not reach the requirement of 200 objects；Overall cluster is correct Rate is (2700-314-162)/(2700-162)=85.88%, and cluster result is preferable；Embodiment two the result shows that：Overall cluster Accuracy be (2700-502-58)/(2700-58)=81.00%, slightly below test one as a result, but application method two phase It calculates and more data frames can be clustered like degree, but simultaneously between processing similarity relatively low (such as 0.2 to 0.3) It is easier to produce the cluster of mistake when data frame.

Embodiment three carries out cluster calculation to above 9 kinds of agreements using common clustering algorithm in weka tools, first uses StringToWordVector makees data prediction, and it is 9 then to specify the cluster number parameter of each clustering algorithm, is used Classes toclusters evaluation are assessed, each clustering algorithm is clustered 3 times using different random seeds, As a result its average value is taken, obtained result is respectively as shown in following table 3-1,3-2 and 3-3：

The result of SimplKMeans clustering algorithms in table 3-1 weka

The result of sIB clustering algorithms in table 3-2 weka

The result of EM clustering algorithms in table 3-3 weka

Comparative example one and embodiment three, it can be seen that using the present invention is based on the protocol classifications for improving AGNES algorithms Method, and using the method one in step S312, overall accuracy is higher than SimpleKMeans algorithms 20.43%, is calculated higher than sIB Method 11.48% is higher than EM algorithms 13.08%.

Claims

1. based on the protocol classification method for improving AGNES algorithms, it is characterised in that：It includes the following steps：

S1. the data set DataSet of n data object is inputted, setting is minimum to merge similarity lowestSimi, minimum class cluster Object number lowestSize and similarity reduce the value of step-length temp, wherein the minimum similarity lowestSimi that merges is less than 1；

S3. by i-th of class cluster in data set DataSet in addition to itself all class clusters be compared and cluster, wherein 1≤i≤ N, and i is integer；

S4. the value of i is converted, cycle executes comparison and the cluster of step S3, until i takes integer value all in 1~n；

(1) the class cluster object number that cluster obtains is more than minimum class cluster object number lowestSize, cluster result is added poly- The object number set that class cluster contains is added in the object number that class cluster contains by class results set clusterResultSet indexResultSet；Similarity evaluation set similarSet is added in the value of current similar, and jumps to step S6；

(2) the class cluster object number that cluster obtains is not more than minimum class cluster object number lowestSize；Go to step S6；

S6. the value of similar is reduced, updated similar values, which take the similar values before update to subtract similarity, reduces step-length Temp, and judge whether the value of similar after update is more than and minimum merge similarity lowestSimi：

S7. cluster terminates, and checks the remaining data frame that cannot merge, and is added into residue and does not form the data object preferably clustered Set leftDataSet.

2. according to claim 1 based on the protocol classification method for improving AGNES algorithms, it is characterised in that：The step Rapid S3 includes following sub-step：

S31. by class cluster i respectively with currently'sAll class clusters of the data set DataSet in addition to itself compare, and find out current'sNumber According to collection DataSet in the highest class cluster j of class cluster i similarities；Wherein, class cluster i is initial'sI-th in data set DataSet Class cluster；

3. according to claim 2 based on the protocol classification method for improving AGNES algorithms, it is characterised in that：The step Rapid S31 includes following sub-step：

S311. calculate separately class cluster i with it is current'sAll class clusters in data set DataSet in addition to itself ask similar respectively Degree；

4. according to claim 3 based on the protocol classification method for improving AGNES algorithms, it is characterised in that：Two class clusters Between similarity solved according to following formula：

p∈c_i, p' ∈ c_j,

In formula, d_avg(c_i,c_j) indicate two class cluster c_i, c_jBetween similarity, n_iIndicate class cluster c_iIn include data frame bar Number, n_jIndicate class cluster c_jIn include data frame number, similarities of the d (p, p ') between data frame p and data frame p '.

5. according to claim 4 based on the protocol classification method for improving AGNES algorithms, it is characterised in that：The d There are two types of the method for solving of (p, p ')：

Method one：Similarity between direct solution data frame, i.e. solution d (p, p ')：

d_(p,p')=sam (p, p')/sum (p, p'),

Wherein sam (p, p ') it is by the following result for operating and obtaining：By data frame p, p ' is single with nibble with left alignment Position, from left to right to data frame p, the alignment characters of p' are compared, the number for encountering character same case be sam (p, P '), and sum (p, p ') is to calculate the number compared when sam (p, p ')；

Method two：A character string will be treated as per data frame, and similarity similar between character string (p, p ') it is required D (p, p ')：

In formula, length (p), length (p ') are respectively the length of character string p and character string p ', and Distance (p, p ') it is two The editing distance of character string indicates character string p becoming the operation that character string p ' needs by insertion, replacement, delete operation Number.