CN104216920A

CN104216920A - Data classification method based on clustering and Hungary algorithm

Info

Publication number: CN104216920A
Application number: CN201310220527.0A
Authority: CN
Inventors: 胡勇
Original assignee: Beijing Cheerbright Technologies Co Ltd
Current assignee: Beijing Cheerbright Technologies Co Ltd
Priority date: 2013-06-05
Filing date: 2013-06-05
Publication date: 2014-12-17
Anticipated expiration: 2033-06-05
Also published as: CN104216920B

Abstract

The invention provides a data classification method based on clustering and a Hungary algorithm. The method includes the steps that the original sample set is read; all samples in an original sample set(X1,X2,...XN)is regarded as non-classification samples, primary clustering is conducted on all the samples in the original sample set(X1,X2,...XN)through a clustering method, and L+C categories are obtained; the L known categories are assigned to L categories in the L+C categories through the Hungary algorithm, and the categories obtained through primary clustering are made to correspond to the known categories; all samples in a known classification sample subset(X1,X2,...XN)are divided to the corresponding categories which the samples belong to, then the categories which the samples in the known classification sample subset(X1,X2,...XN)belong to are kept the same, clustering is conducted again, iteration is conducted on unmarked samples through a target function, and the unmarked samples are assigned to certain categories or are regarded as background noise. By means of the data classification method based on clustering and the Hungary algorithm, data can be classified precisely and easily, and the classification result is precise.

Description

Based on the data classification method of cluster and Hungary Algorithm

Technical field

The invention belongs to Data Classification Technology field, be specifically related to a kind of data classification method based on cluster and Hungary Algorithm.

Background technology

During to sample analysis, be often that the classification of part sample is known, the sample of known class is not a lot, and the noise of may having powerful connections does not belong to any classification.

Therefore, for such problem, if use sorting algorithm, can not generate reliable sorter, that is: the sorter possible deviation of generation is larger, again can not be branching away without the class not marking; If with clustering algorithm, ignore again the reference value that marks sample; And clustering algorithm also cannot solve the treatment classification problem to background noise.More approaching method is part semi-supervised learning algorithm, mainly contains at present two kinds: the first, learn from marking and not marking sample; The second, from positive example with do not mark sample learning.For the first, require the classification having marked all to have mark sample, limitation is larger.And for the second, be two sorting algorithms to positive example and counter-example, cannot solve part class has carried out the situation that mark, part class do not mark; Can not solve the situation of the noise of having powerful connections.

Summary of the invention

The defect existing for prior art, the invention provides a kind of data classification method based on cluster and Hungary Algorithm, can accurately simply classify to data, and classification results is accurate.

The technical solution used in the present invention is as follows:

The invention provides a kind of data classification method based on cluster and Hungary Algorithm, comprise the following steps:

S1, reads original sample collection { X ₁, X ₂... X _n;

Original sample collection { X ₁, X ₂... X _ncomprise known classification samples subset { X ₁, X ₂... X _nand unknown classification samples subset { X _n+1, X _n+2... X _n; Wherein, known classification samples subset { X ₁, X ₂... X _nin classification Y under each sample _ibe respectively Y ₁, Y ₂... Y _n; In known classification samples subset, known class number is L;

Unknown classification samples subset { X _n+1, X _n+2... X _nin unknown classification number be C;

S2, by original sample collection { X ₁, X ₂... X _nin all samples be considered as without classification samples, all samples that original sample is concentrated adopt clustering methods to carry out cluster first, obtain L+C classification;

S3, is assigned to L+C the classification of the L in classification by L known class by Hungary Algorithm, on corresponding with known class the classification that cluster obtains first;

S4, by known classification samples subset { X ₁, X ₂... X _nin each sample be divided in the class of its ownership, then keep known classification samples subset { X ₁, X ₂... X _nin under each sample class constant, cluster again, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise.

Preferably, in S2, described clustering method is KMeans clustering method or hierarchical cluster method.

Preferably, in S4, the clustering method again adopting when cluster is KMeans clustering method or hierarchical cluster method.

Preferably, in S4, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise, is specially:

Whether the sample that uses objective function iteration not mark, reach extreme value identification background noise by objective function; In the time that this iteration result and last iteration result no longer change, or objective function is while no longer changing, finishes classification.

Preferably, described objective function is set as: degree of polymerization * discrimination in dispersion degree * class between class.

Preferably, between described class dispersion degree with between between mean distance, class between mean square distance, class between minor increment or class ultimate range represent.

Preferably, in described class, the degree of polymerization represents by ultimate range in mean square distance or class in mean distance, class in class.

Preferably, described discrimination expression formula is: classification number/total number of samples.

Beneficial effect of the present invention is as follows:

The invention provides a kind of data classification method based on cluster and Hungary Algorithm, be applicable to following situation: the classification of part sample is known, the sample point of known classification does not need a lot, can have the unknown classification of part is such sample point that there is no mark, and the noise of can having powerful connections is that noise point does not belong to any class; Can accurately simply classify to data, and classification results is accurate.

Brief description of the drawings

Fig. 1 is the data classification method schematic flow sheet based on cluster and Hungary Algorithm provided by the invention;

Fig. 2 is original sample set demonstration figure in embodiment bis-;

Fig. 3 is that in embodiment bis-, sample is expected classification demonstration figure;

Fig. 4 is sample actual classification demonstration figure in embodiment bis-.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail:

Embodiment mono-

S1, reads original sample collection { X ₁, X ₂... X _n;

Unknown classification samples subset { X _n+1, X _n+2... X _nin unknown classification number be C.

For example, N=650, L=2, C=3, n=50;

Original sample collection is made up of 650 malicious code samples, and the numbering of 650 malicious code samples is respectively: X ₁, X ₂... X ₆₅₀.In these 650 malicious code samples, numbering is respectively X ₁, X ₂.。。X ₅₀the classification of 50 malicious code samples known, that is: 50 malicious code samples belong to 2 classifications, are respectively: 25 hacker's viruses and 25 macrovirus.Be numbered X ₅₁, X ₅₂.。。X ₆₅₀classification the unknowns of 600 malicious code samples, unknown malicious code sample classification can be any type, for example, can be script virus, wooden horse, worm hacker virus or macrovirus, certain, also can not belong to any type.

In this step, clustering method can adopt common clustering algorithm, such as KMeans, hierarchical cluster etc.

For above-mentioned malicious code sample collection, in the time that 650 samples are carried out to cluster, obtain altogether 5 classifications.It is emphasized that in this step, have to 5 classes by cluster, but do not distinguish class name, that is, not distinguishing which class is hacker's virus, and which class is macrovirus.Just will belong to of a sort malicious code sample is brought together.

S3, is assigned to L+C the classification of the L in classification by L known class by Hungary Algorithm, on corresponding with known class the classification that cluster obtains first.

For example,, due to X ₁, X ₂.。。X ₅₀the classification of 50 malicious code samples known, comprise hacker's virus and macrovirus.Therefore, by assigning, think hacker's virus type by comprising the maximum class of hacker's virus in 5 classes, think macrovirus class by comprising the class that macrovirus is maximum in 5 classes.

For example, for 50 known malicious code samples of classification, suppose that one co-exists in X ₁, X ₂.。X ₂₀totally 20 hacker's viruses, in the time of cluster first, exist X ₁, X ₂.。X ₁₅totally 15 hacker's viruses are gathered in hacker's virus type, and by X ₁₆, X ₁₇..X ₂₀totally 5 hacker's viruses are gathered the possibility in other virus type.Therefore, after appointment, need to be by hacker's virus X known classification ₁₆, X ₁₇.。X ₂₀be divided in the class of its ownership.

Based on common clustering algorithm, such as KMeans, hierarchical cluster etc. are again after cluster, whether the sample that need to use objective function iteration not mark, reach extreme value identification background noise by objective function.In the time that this iteration result and last iteration result no longer change, or objective function is while no longer changing, finishes classification.

Concrete, objective function can be set as: degree of polymerization * discrimination in dispersion degree * class between class; Below these three parameters are introduced respectively:

(1) dispersion degree between class

Between class, dispersion degree is according to concrete application, can select between mean distance between class, class between mean square distance, class ultimate range between minor increment or class to represent.

Can be made as: (mean distance of sample between inhomogeneity)/(mean distance between all samples that have classification)=(the total distances between the total distance-all kinds of interior samples of all samples that have classification) all sample number * (all sample number-1 that has classification)/∑s (the sample number * (sample number-1 of certain classification) of certain classification) that have classification of/(all total distances of sample that has classification) *.

For example: between inhomogeneity, the mean distance of sample refers to: if co-exist in hacker's virus, macrovirus, script virus, wooden horse and five classes of worm,

Xi, Xj represents sample; Dij represents the distance between Xi and Xj; Yi represents the class numbering that Xi is assigned to, and Yj represents the class numbering that Xj is assigned to, if Yi=0 represents that Xi is regardless of to any class, when Xi is assigned to certain class, Yi value is 1～certain number in (L+C):

Between all samples that have classification, mean distance refers to: to all i of Yi > 0 and Yj > 0, j, asks the mean value of dij;

The total distance of all samples that have a classification refers to: to all i of Yi > 0 and Yj > 0, j, asks the summation of dij;

All sample numbers that have classification refer to: to the number of Yi > 0 all i;

Total distance between the samples in all kinds of refers to: to Yi > 0, and all i of Yj > 0 and Yi=Yj, j, asks the summation of dij;

Between inhomogeneity, the mean distance of sample refers to: to Yi > 0, and all i of Yj > 0 and Yi ≠ Yj, j, asks the mean value of dij.

(2) degree of polymerization in class

In class, the degree of polymerization can be used similarity, and also the negative exponent form of available range represents, in class, the degree of polymerization can represent by ultimate range in mean square distance or class in mean distance, class in class.

(3) discrimination

The simple form of discrimination is: classification number/total number of samples can be also a function of this ratio.

For example: for above-mentioned malicious code sample collection, total number of samples is 650, if 650 malicious code samples all have class mark, discrimination is 100%; If there are 10 malicious code samples to be considered to not belong to any class, discrimination is (650-10)/650.

Character: between class, dispersion degree is the bigger the better, in class, the degree of polymerization is the bigger the better, and discrimination is more high better.

Embodiment bis-

The data classification method based on cluster and Hungary Algorithm that adopts the embodiment of the present invention one to provide, introduces a kind of concrete application below.

As shown in Figure 2, be original sample collection, comprise altogether 1000 samples and 100 noise points; Wherein, original sample collection is roughly divided into 4 classes, " six " of the similar Chinese character of shape." slash ", " right-falling stroke " below " six " word are respectively had to 10 samples that marked, see each sample that A in Fig. 2, B are pointed, be and mark sample; Wherein, A sample pointed represents a classification, and B sample pointed represents another classification." point " above, " horizontal stroke " are the not sample of mark.

As shown in Figure 3, be the desired classification results of Cluster Classification, expectation can divide out 4 classes, in Fig. 3 with the circular or oval sample coming that encloses, other be scattered in sample outside circle or ellipse as a setting noisy samples do not do and sort out.

By adopting the data classification method based on cluster and Hungary Algorithm provided by the invention, obtain classification results as shown in Figure 4, as can be seen from Figure 4, the present invention has identified four stroke parts of " six " substantially, and filtering the sample of background noise.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be looked protection scope of the present invention.

Claims

1. the data classification method based on cluster and Hungary Algorithm, is characterized in that, comprises the following steps:

S1, reads original sample collection { X ₁, X ₂... X _n;

2. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S2, described clustering method is KMeans clustering method or hierarchical cluster method.

3. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S4, the clustering method again adopting when cluster is KMeans clustering method or hierarchical cluster method.

4. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S4, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise, is specially:

5. the data classification method based on cluster and Hungary Algorithm according to claim 4, is characterized in that, described objective function is set as: degree of polymerization * discrimination in dispersion degree * class between class.

6. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, between described class dispersion degree with between between mean distance, class between mean square distance, class between minor increment or class ultimate range represent.

7. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, in described class, the degree of polymerization represents by ultimate range in mean square distance or class in mean distance, class in class.

8. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, described discrimination expression formula is: classification number/total number of samples.