CN104216920B

CN104216920B - Data classification method based on cluster and Hungary Algorithm

Info

Publication number: CN104216920B
Application number: CN201310220527.0A
Authority: CN
Inventors: 胡勇
Original assignee: Beijing Cheerbright Technologies Co Ltd
Current assignee: Beijing Cheerbright Technologies Co Ltd
Priority date: 2013-06-05
Filing date: 2013-06-05
Publication date: 2017-11-21
Anticipated expiration: 2033-06-05
Also published as: CN104216920A

Abstract

The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, including：Read original sample collection { X₁、X₂...X_N}；By original sample collection { X₁、X₂...X_NIn all samples be considered as no classification samples, to original sample concentrate all samples clustered first using clustering method, obtain L+C classification；L classification L known class being assigned to by Hungary Algorithm in L+C classification, will cluster first obtained classification it is corresponding with known class on；By known classification samples subset { X₁、X₂...X_nIn each sample be divided into its ownership class in, then keep known to classification samples subset { X₁、X₂...X_nIn each affiliated class of sample it is constant, cluster again, the sample not marked using object function iteration, the sample for making not mark assigns to some classification or is considered as background noise.Accurately simply data can be classified, and classification results are accurate.

Description

Data classification method based on cluster and Hungary Algorithm

Technical field

The invention belongs to data classification technology field, and in particular to a kind of data based on cluster and Hungary Algorithm are classified Method.

Background technology

During to sample analysis, often the classification of part sample is not, it is known that the sample of known class is a lot, moreover, can The noise that can have powerful connections is not belonging to any classification.

Therefore, for such problem, if using sorting algorithm, it is impossible to generate reliable grader, i.e.,：Caused point Class device possible deviation is larger, the class that nothing does not mark can not be branched away again；If with clustering algorithm, ignore again and marked sample Reference value；Moreover, clustering algorithm can not also solve the problems, such as the treatment classification to background noise.Method relatively is portion Divide semi-supervised learning algorithm, mainly there are two kinds at present：The first, is learnt from having marked and not marked in sample；Second, Sample learning is not marked from positive example and.For the first, it is desirable to the classification marked all has a mark sample, limitation compared with Greatly.And for second, it is two sorting algorithms to positive example and counter-example, can not solves part class and carry out mark, part class not The situation of mark；Can not solve the situation for having powerful connections noise.

The content of the invention

The defects of existing for prior art, the present invention provide a kind of data classification side based on cluster and Hungary Algorithm Method, accurately simply data can be classified, and classification results are accurate.

The technical solution adopted by the present invention is as follows：

The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, comprises the following steps：

S1, read original sample collection { X₁、X₂...X_N}；

Original sample collection { X₁、X₂...X_NInclude known classification samples subset { X₁、X₂...X_nAnd unknown classification samples Collect { X_n+1、X_n+2...X_N}；It is known that classification samples subset { X₁、X₂...X_nIn each sample generic Y_iRespectively Y₁、 Y₂...Y_n；Known class number is L in known classification samples subset；

Unknown classification samples subset { X_n+1、X_n+2...X_NIn unknown classification number be C；

S2, by original sample collection { X₁、X₂...X_NIn all samples be considered as no classification samples, to original sample concentrate institute There is sample to be clustered first using clustering method, obtain L+C classification；

S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first To classification it is corresponding with known class on；

S4, by known classification samples subset { X₁、X₂...X_nIn each sample be divided into its ownership class in, then keep Known classification samples subset { X₁、X₂...X_nIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again The sample of note, the sample for making not mark assign to some classification or are considered as background noise.

Preferably, in S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods.

Preferably, in S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again.

Preferably, in S4, the sample that is not marked using object function iteration, the sample for making not mark assign to some classification or It is considered as background noise, is specially：

The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value；When When current iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification.

Preferably, the object function is set as：Degree of polymerization * discriminations in decentralization * classes between class.

Preferably, it is maximum between minimum range or class between mean square distance, class between average distance, class between decentralization is used between the class Distance represents.

Preferably, the degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class.

Preferably, the discrimination expression formula is：Classification number/total number of samples.

Beneficial effects of the present invention are as follows：

The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, suitable for situations below：Part The classification of sample, it is known that the sample point of known classification be not required to a lot, can to have part it is unknown classify i.e. such there is no the sample of mark This point, and the noise i.e. noise point that can have powerful connections is not belonging to any class；Accurately simply data can be classified, and point Class result is accurate.

Brief description of the drawings

Fig. 1 is the data classification method schematic flow sheet provided by the invention based on cluster and Hungary Algorithm；

Fig. 2 is original sample set display figure in embodiment two；

Fig. 3 is that sample it is expected display figure of classifying in embodiment two；

Fig. 4 is sample actual classification display figure in embodiment two.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail：

Embodiment one

S1, read original sample collection { X₁、X₂...X_N}；

Unknown classification samples subset { X_n+1、X_n+2...X_NIn unknown classification number be C.

For example, N=650, L=2, C=3, n=50；

Original sample collection is made up of 650 malicious code samples, and the numbering of 650 malicious code samples is respectively：X₁、 X₂...X₆₅₀.In this 650 malicious code samples, numbering is respectively X₁、X₂。。。X₅₀50 malicious code samples classification , it is known that i.e.：50 malicious code samples belong to 2 classifications, are respectively：25 hacker's viruses and 25 macrovirus.Numbering is X₅₁、 X₅₂。。。X₆₅₀600 malicious code samples classification it is unknown, unknown malicious code sample classification can be any types, example Such as, can be script virus, wooden horse, worm hacker virus or macrovirus, it is of course also possible to be not belonging to any types.

In this step, clustering method can use common clustering algorithm, such as KMeans, hierarchical cluster etc..

For above-mentioned malicious code sample collection, when being clustered to 650 samples, 5 classifications are obtained.Require emphasis , in this step, 5 classes are had to by cluster, but do not differentiate between class name, i.e. and which class is not differentiated between as hacker's virus, which Individual class is macrovirus.Of a sort malicious code sample will simply be belonged to be brought together.

S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first To classification it is corresponding with known class on.

For example, due to X₁、X₂。。。X₅₀50 malicious code samples classification, it is known that including hacker virus and macrovirus. Therefore, by assigning, the most class of hacker's virus will be included in 5 classes and is considered hacker's virus type, grand disease will be included in 5 classes The most class of poison is considered macrovirus class.

For example, for 50 malicious code samples known to classification, it is assumed that one co-exists in X₁、X₂。。X₂₀Totally 20 hacker's diseases Poison, when clustering first, exist X₁、X₂。。X₁₅Totally 15 hacker's viruses are gathered in hacker's virus type, and by X₁₆、X₁₇..X₂₀ Totally 5 hacker's viruses gather the possibility in other virus types.Therefore, it is necessary to which hacker known to classification is viral after appointment X₁₆、X₁₇。。X₂₀It is divided into the class of its ownership.

Based on common clustering algorithm, such as, it is necessary to use object function iteration after KMeans, hierarchical cluster etc. cluster again The sample not marked, background noise is identified by the way that whether object function reaches extreme value.When current iteration result and last iteration knot When fruit no longer changes, or when object function no longer changes, terminate classification.

Specifically, object function can be set as：Degree of polymerization * discriminations in decentralization * classes between class；These three are joined below Number is introduced respectively：

(1) decentralization between class

Decentralization is according to concrete application between class, may be selected between class between average distance, class between mean square distance, class minimum range or Ultimate range represents between class.

It can be set to：(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have Total distance between sample in total distance of the sample of classification-all kinds of)/(all total distances of sample for having classification) * is all has point Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification) of class.

Such as：The average distance of sample refers between inhomogeneity：If co-exist in hacker's virus, macrovirus, script virus, wooden horse With five classes of worm,

Xi, Xj represent sample；Dij represents the distance between Xi and Xj；Yi represents the class numbering that Xi is assigned to, and Yj represents Xj quilts The class numbering assigned to, if Yi=0 represents Xi regardless of to any class, Yi values are in 1~(L+C) when Xi is assigned to some class Certain number, then：

Average distance refers between all samples for having a classification：To Yi ＞ 0 and Yj ＞ 0 all i, j, dij average value is sought；

All total distances of sample for having classification refer to：To Yi ＞ 0 and Yj ＞ 0 all i, j, dij summation is sought；

All sample numbers for having classification refer to：To 0 all i of Yi ＞ number；

Total distance between sample in all kinds of refers to：To Yi ＞ 0, Yj ＞ 0 and Yi=Yj all i, j ask that dij's is total With；

The average distance of sample refers between inhomogeneity：To Yi ＞ 0, Yj ＞ 0 and Yi ≠ Yj all i, j, being averaged for dij is asked Value.

(2) degree of polymerization in class

The degree of polymerization can use similarity in class, it is also possible to which the negative exponent form of distance represents that the degree of polymerization can be used in class in class Ultimate range represents in mean square distance or class in average distance, class.

(3) discrimination

Discrimination simplest form is：One function of classification number/total number of samples or this ratio.

Such as：For above-mentioned malicious code sample collection, total number of samples is 650, if 650 malicious code samples have Class identifies, then discrimination is 100%；If 10 malicious code samples are considered as being not belonging to any class, then discrimination is (650-10)/650。

Property：Decentralization is the bigger the better between class, and the degree of polymerization is the bigger the better in class, and discrimination is the higher the better.

Embodiment two

The data classification method based on cluster and Hungary Algorithm provided using the embodiment of the present invention one, is described below one Kind concrete application.

As shown in Fig. 2 it is original sample collection, altogether including 1000 samples and 100 noise points；Wherein, original sample collection Substantially divide into 4 classes, " six " of the shape similar to Chinese character.Respectively there are 10 samples marked to " slash " below " six " word, " right-falling stroke " This, sees each sample pointed by A, B in Fig. 2, has as marked sample；Wherein, one classification of the sample representation pointed by A, B Another classification of pointed sample representation." point ", " horizontal stroke " above is the sample not marked.

As shown in figure 3, it is the desired classification results of Cluster Classification, it is desirable to 4 classes are marked off to come, i.e. with circle in Fig. 3 Shape or ellipse enclose the sample come, and other samples being scattered in outside circular or ellipse are not done as background noise sample returns Class.

By using the data classification method provided by the invention based on cluster and Hungary Algorithm, obtain as shown in Figure 4 Classification results, from fig. 4, it can be seen that the present invention have identified four stroke parts of " six " substantially, and filtered out background The sample of noise.

Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should Depending on protection scope of the present invention.

Claims

1. a kind of data classification method based on cluster and Hungary Algorithm, it is characterised in that comprise the following steps：

S1, read original sample collection { X₁、X₂…X_N}；

Original sample collection { X₁、X₂…X_NInclude known classification samples subset { X₁、X₂…X_nAnd unknown classification samples subset { X_n+1、 X_n+2…X_N}；It is known that classification samples subset { X₁、X₂…X_nIn each sample generic Y_iRespectively Y₁、Y₂…Y_n； Know that known class number is L in classification samples subset；

Unknown classification samples subset { X_n+1、X_n+2…X_NIn unknown classification number be C；

S2, by original sample collection { X₁、X₂…X_NIn all samples be considered as no classification samples, to original sample concentrate all samples This is clustered first using clustering method, obtains L+C classification；

S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, cluster it will obtain first On classification is corresponding with known class；

S4, by known classification samples subset { X₁、X₂…X_nIn each sample be divided into its ownership class in, then keep known to Classification samples subset { X₁、X₂…X_nIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again Sample, the sample for making not mark assign to some classification or are considered as background noise；

In S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods；

In S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again；

In S4, the sample that is not marked using object function iteration, the sample for making not mark is assigned to some classification or is considered as background and made an uproar Sound, it is specially：

The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value；When this When iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification；

The object function is set as：Degree of polymerization * discriminations in decentralization * classes between class；

Ultimate range represents decentralization between minimum range or class between mean square distance, class between average distance, class between the class；

It can be set to：(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have classification Sample total distance-all kinds of in sample between total distance)/(all total distances of sample for having classification) * is all classification Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification)；

Average distance refers between all samples for having a classification：To Yi>0 and Yj>0 all i, j, seek dij average value；

All total distances of sample for having classification refer to：To Yi>0 and Yj>0 all i, j, seek dij summation；

All sample numbers for having classification refer to：To Yi>0 all i number；

Total distance between sample in all kinds of refers to：To Yi>0,Yj>0 and Yi=Yj all i, j, seek dij summation；

The average distance of sample refers between inhomogeneity：To Yi>0,Yj>0 and Yi ≠ Yj all i, j, seek dij average value；

The degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class；

The discrimination expression formula is：Classification number/total number of samples.