CN104216920A - Data classification method based on clustering and Hungary algorithm - Google Patents

Data classification method based on clustering and Hungary algorithm Download PDF

Info

Publication number
CN104216920A
CN104216920A CN201310220527.0A CN201310220527A CN104216920A CN 104216920 A CN104216920 A CN 104216920A CN 201310220527 A CN201310220527 A CN 201310220527A CN 104216920 A CN104216920 A CN 104216920A
Authority
CN
China
Prior art keywords
classification
class
samples
cluster
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310220527.0A
Other languages
Chinese (zh)
Other versions
CN104216920B (en
Inventor
胡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cheerbright Technologies Co Ltd
Original Assignee
Beijing Cheerbright Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cheerbright Technologies Co Ltd filed Critical Beijing Cheerbright Technologies Co Ltd
Priority to CN201310220527.0A priority Critical patent/CN104216920B/en
Publication of CN104216920A publication Critical patent/CN104216920A/en
Application granted granted Critical
Publication of CN104216920B publication Critical patent/CN104216920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data classification method based on clustering and a Hungary algorithm. The method includes the steps that the original sample set is read; all samples in an original sample set(X1,X2,...XN)is regarded as non-classification samples, primary clustering is conducted on all the samples in the original sample set(X1,X2,...XN)through a clustering method, and L+C categories are obtained; the L known categories are assigned to L categories in the L+C categories through the Hungary algorithm, and the categories obtained through primary clustering are made to correspond to the known categories; all samples in a known classification sample subset(X1,X2,...XN)are divided to the corresponding categories which the samples belong to, then the categories which the samples in the known classification sample subset(X1,X2,...XN)belong to are kept the same, clustering is conducted again, iteration is conducted on unmarked samples through a target function, and the unmarked samples are assigned to certain categories or are regarded as background noise. By means of the data classification method based on clustering and the Hungary algorithm, data can be classified precisely and easily, and the classification result is precise.

Description

Based on the data classification method of cluster and Hungary Algorithm
Technical field
The invention belongs to Data Classification Technology field, be specifically related to a kind of data classification method based on cluster and Hungary Algorithm.
Background technology
During to sample analysis, be often that the classification of part sample is known, the sample of known class is not a lot, and the noise of may having powerful connections does not belong to any classification.
Therefore, for such problem, if use sorting algorithm, can not generate reliable sorter, that is: the sorter possible deviation of generation is larger, again can not be branching away without the class not marking; If with clustering algorithm, ignore again the reference value that marks sample; And clustering algorithm also cannot solve the treatment classification problem to background noise.More approaching method is part semi-supervised learning algorithm, mainly contains at present two kinds: the first, learn from marking and not marking sample; The second, from positive example with do not mark sample learning.For the first, require the classification having marked all to have mark sample, limitation is larger.And for the second, be two sorting algorithms to positive example and counter-example, cannot solve part class has carried out the situation that mark, part class do not mark; Can not solve the situation of the noise of having powerful connections.
Summary of the invention
The defect existing for prior art, the invention provides a kind of data classification method based on cluster and Hungary Algorithm, can accurately simply classify to data, and classification results is accurate.
The technical solution used in the present invention is as follows:
The invention provides a kind of data classification method based on cluster and Hungary Algorithm, comprise the following steps:
S1, reads original sample collection { X 1, X 2... X n;
Original sample collection { X 1, X 2... X ncomprise known classification samples subset { X 1, X 2... X nand unknown classification samples subset { X n+1, X n+2... X n; Wherein, known classification samples subset { X 1, X 2... X nin classification Y under each sample ibe respectively Y 1, Y 2... Y n; In known classification samples subset, known class number is L;
Unknown classification samples subset { X n+1, X n+2... X nin unknown classification number be C;
S2, by original sample collection { X 1, X 2... X nin all samples be considered as without classification samples, all samples that original sample is concentrated adopt clustering methods to carry out cluster first, obtain L+C classification;
S3, is assigned to L+C the classification of the L in classification by L known class by Hungary Algorithm, on corresponding with known class the classification that cluster obtains first;
S4, by known classification samples subset { X 1, X 2... X nin each sample be divided in the class of its ownership, then keep known classification samples subset { X 1, X 2... X nin under each sample class constant, cluster again, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise.
Preferably, in S2, described clustering method is KMeans clustering method or hierarchical cluster method.
Preferably, in S4, the clustering method again adopting when cluster is KMeans clustering method or hierarchical cluster method.
Preferably, in S4, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise, is specially:
Whether the sample that uses objective function iteration not mark, reach extreme value identification background noise by objective function; In the time that this iteration result and last iteration result no longer change, or objective function is while no longer changing, finishes classification.
Preferably, described objective function is set as: degree of polymerization * discrimination in dispersion degree * class between class.
Preferably, between described class dispersion degree with between between mean distance, class between mean square distance, class between minor increment or class ultimate range represent.
Preferably, in described class, the degree of polymerization represents by ultimate range in mean square distance or class in mean distance, class in class.
Preferably, described discrimination expression formula is: classification number/total number of samples.
Beneficial effect of the present invention is as follows:
The invention provides a kind of data classification method based on cluster and Hungary Algorithm, be applicable to following situation: the classification of part sample is known, the sample point of known classification does not need a lot, can have the unknown classification of part is such sample point that there is no mark, and the noise of can having powerful connections is that noise point does not belong to any class; Can accurately simply classify to data, and classification results is accurate.
Brief description of the drawings
Fig. 1 is the data classification method schematic flow sheet based on cluster and Hungary Algorithm provided by the invention;
Fig. 2 is original sample set demonstration figure in embodiment bis-;
Fig. 3 is that in embodiment bis-, sample is expected classification demonstration figure;
Fig. 4 is sample actual classification demonstration figure in embodiment bis-.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail:
Embodiment mono-
S1, reads original sample collection { X 1, X 2... X n;
Original sample collection { X 1, X 2... X ncomprise known classification samples subset { X 1, X 2... X nand unknown classification samples subset { X n+1, X n+2... X n; Wherein, known classification samples subset { X 1, X 2... X nin classification Y under each sample ibe respectively Y 1, Y 2... Y n; In known classification samples subset, known class number is L;
Unknown classification samples subset { X n+1, X n+2... X nin unknown classification number be C.
For example, N=650, L=2, C=3, n=50;
Original sample collection is made up of 650 malicious code samples, and the numbering of 650 malicious code samples is respectively: X 1, X 2... X 650.In these 650 malicious code samples, numbering is respectively X 1, X 2.。。X 50the classification of 50 malicious code samples known, that is: 50 malicious code samples belong to 2 classifications, are respectively: 25 hacker's viruses and 25 macrovirus.Be numbered X 51, X 52.。。X 650classification the unknowns of 600 malicious code samples, unknown malicious code sample classification can be any type, for example, can be script virus, wooden horse, worm hacker virus or macrovirus, certain, also can not belong to any type.
S2, by original sample collection { X 1, X 2... X nin all samples be considered as without classification samples, all samples that original sample is concentrated adopt clustering methods to carry out cluster first, obtain L+C classification;
In this step, clustering method can adopt common clustering algorithm, such as KMeans, hierarchical cluster etc.
For above-mentioned malicious code sample collection, in the time that 650 samples are carried out to cluster, obtain altogether 5 classifications.It is emphasized that in this step, have to 5 classes by cluster, but do not distinguish class name, that is, not distinguishing which class is hacker's virus, and which class is macrovirus.Just will belong to of a sort malicious code sample is brought together.
S3, is assigned to L+C the classification of the L in classification by L known class by Hungary Algorithm, on corresponding with known class the classification that cluster obtains first.
For example,, due to X 1, X 2.。。X 50the classification of 50 malicious code samples known, comprise hacker's virus and macrovirus.Therefore, by assigning, think hacker's virus type by comprising the maximum class of hacker's virus in 5 classes, think macrovirus class by comprising the class that macrovirus is maximum in 5 classes.
S4, by known classification samples subset { X 1, X 2... X nin each sample be divided in the class of its ownership, then keep known classification samples subset { X 1, X 2... X nin under each sample class constant, cluster again, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise.
For example, for 50 known malicious code samples of classification, suppose that one co-exists in X 1, X 2.。X 20totally 20 hacker's viruses, in the time of cluster first, exist X 1, X 2.。X 15totally 15 hacker's viruses are gathered in hacker's virus type, and by X 16, X 17..X 20totally 5 hacker's viruses are gathered the possibility in other virus type.Therefore, after appointment, need to be by hacker's virus X known classification 16, X 17.。X 20be divided in the class of its ownership.
Based on common clustering algorithm, such as KMeans, hierarchical cluster etc. are again after cluster, whether the sample that need to use objective function iteration not mark, reach extreme value identification background noise by objective function.In the time that this iteration result and last iteration result no longer change, or objective function is while no longer changing, finishes classification.
Concrete, objective function can be set as: degree of polymerization * discrimination in dispersion degree * class between class; Below these three parameters are introduced respectively:
(1) dispersion degree between class
Between class, dispersion degree is according to concrete application, can select between mean distance between class, class between mean square distance, class ultimate range between minor increment or class to represent.
Can be made as: (mean distance of sample between inhomogeneity)/(mean distance between all samples that have classification)=(the total distances between the total distance-all kinds of interior samples of all samples that have classification) all sample number * (all sample number-1 that has classification)/∑s (the sample number * (sample number-1 of certain classification) of certain classification) that have classification of/(all total distances of sample that has classification) *.
For example: between inhomogeneity, the mean distance of sample refers to: if co-exist in hacker's virus, macrovirus, script virus, wooden horse and five classes of worm,
Xi, Xj represents sample; Dij represents the distance between Xi and Xj; Yi represents the class numbering that Xi is assigned to, and Yj represents the class numbering that Xj is assigned to, if Yi=0 represents that Xi is regardless of to any class, when Xi is assigned to certain class, Yi value is 1~certain number in (L+C):
Between all samples that have classification, mean distance refers to: to all i of Yi > 0 and Yj > 0, j, asks the mean value of dij;
The total distance of all samples that have a classification refers to: to all i of Yi > 0 and Yj > 0, j, asks the summation of dij;
All sample numbers that have classification refer to: to the number of Yi > 0 all i;
Total distance between the samples in all kinds of refers to: to Yi > 0, and all i of Yj > 0 and Yi=Yj, j, asks the summation of dij;
Between inhomogeneity, the mean distance of sample refers to: to Yi > 0, and all i of Yj > 0 and Yi ≠ Yj, j, asks the mean value of dij.
(2) degree of polymerization in class
In class, the degree of polymerization can be used similarity, and also the negative exponent form of available range represents, in class, the degree of polymerization can represent by ultimate range in mean square distance or class in mean distance, class in class.
(3) discrimination
The simple form of discrimination is: classification number/total number of samples can be also a function of this ratio.
For example: for above-mentioned malicious code sample collection, total number of samples is 650, if 650 malicious code samples all have class mark, discrimination is 100%; If there are 10 malicious code samples to be considered to not belong to any class, discrimination is (650-10)/650.
Character: between class, dispersion degree is the bigger the better, in class, the degree of polymerization is the bigger the better, and discrimination is more high better.
Embodiment bis-
The data classification method based on cluster and Hungary Algorithm that adopts the embodiment of the present invention one to provide, introduces a kind of concrete application below.
As shown in Figure 2, be original sample collection, comprise altogether 1000 samples and 100 noise points; Wherein, original sample collection is roughly divided into 4 classes, " six " of the similar Chinese character of shape." slash ", " right-falling stroke " below " six " word are respectively had to 10 samples that marked, see each sample that A in Fig. 2, B are pointed, be and mark sample; Wherein, A sample pointed represents a classification, and B sample pointed represents another classification." point " above, " horizontal stroke " are the not sample of mark.
As shown in Figure 3, be the desired classification results of Cluster Classification, expectation can divide out 4 classes, in Fig. 3 with the circular or oval sample coming that encloses, other be scattered in sample outside circle or ellipse as a setting noisy samples do not do and sort out.
By adopting the data classification method based on cluster and Hungary Algorithm provided by the invention, obtain classification results as shown in Figure 4, as can be seen from Figure 4, the present invention has identified four stroke parts of " six " substantially, and filtering the sample of background noise.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be looked protection scope of the present invention.

Claims (8)

1. the data classification method based on cluster and Hungary Algorithm, is characterized in that, comprises the following steps:
S1, reads original sample collection { X 1, X 2... X n;
Original sample collection { X 1, X 2... X ncomprise known classification samples subset { X 1, X 2... X nand unknown classification samples subset { X n+1, X n+2... X n; Wherein, known classification samples subset { X 1, X 2... X nin classification Y under each sample ibe respectively Y 1, Y 2... Y n; In known classification samples subset, known class number is L;
Unknown classification samples subset { X n+1, X n+2... X nin unknown classification number be C;
S2, by original sample collection { X 1, X 2... X nin all samples be considered as without classification samples, all samples that original sample is concentrated adopt clustering methods to carry out cluster first, obtain L+C classification;
S3, is assigned to L+C the classification of the L in classification by L known class by Hungary Algorithm, on corresponding with known class the classification that cluster obtains first;
S4, by known classification samples subset { X 1, X 2... X nin each sample be divided in the class of its ownership, then keep known classification samples subset { X 1, X 2... X nin under each sample class constant, cluster again, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise.
2. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S2, described clustering method is KMeans clustering method or hierarchical cluster method.
3. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S4, the clustering method again adopting when cluster is KMeans clustering method or hierarchical cluster method.
4. the data classification method based on cluster and Hungary Algorithm according to claim 1, is characterized in that, in S4, the sample that uses objective function iteration not mark, makes the sample not marking assign to certain classification or be considered as background noise, is specially:
Whether the sample that uses objective function iteration not mark, reach extreme value identification background noise by objective function; In the time that this iteration result and last iteration result no longer change, or objective function is while no longer changing, finishes classification.
5. the data classification method based on cluster and Hungary Algorithm according to claim 4, is characterized in that, described objective function is set as: degree of polymerization * discrimination in dispersion degree * class between class.
6. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, between described class dispersion degree with between between mean distance, class between mean square distance, class between minor increment or class ultimate range represent.
7. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, in described class, the degree of polymerization represents by ultimate range in mean square distance or class in mean distance, class in class.
8. the data classification method based on cluster and Hungary Algorithm according to claim 5, is characterized in that, described discrimination expression formula is: classification number/total number of samples.
CN201310220527.0A 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm Active CN104216920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310220527.0A CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310220527.0A CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Publications (2)

Publication Number Publication Date
CN104216920A true CN104216920A (en) 2014-12-17
CN104216920B CN104216920B (en) 2017-11-21

Family

ID=52098417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310220527.0A Active CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Country Status (1)

Country Link
CN (1) CN104216920B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216884A (en) * 2007-12-29 2008-07-09 北京中星微电子有限公司 A method and system for face authentication
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
US20120016878A1 (en) * 2010-07-15 2012-01-19 Xerox Corporation Constrained nonnegative tensor factorization for clustering
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network
US20130086553A1 (en) * 2011-09-29 2013-04-04 Mark Grechanik Systems and methods for finding project-related information by clustering applications into related concept categories

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101216884A (en) * 2007-12-29 2008-07-09 北京中星微电子有限公司 A method and system for face authentication
US20120016878A1 (en) * 2010-07-15 2012-01-19 Xerox Corporation Constrained nonnegative tensor factorization for clustering
US20130086553A1 (en) * 2011-09-29 2013-04-04 Mark Grechanik Systems and methods for finding project-related information by clustering applications into related concept categories
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
安亚静: "智能降维技术的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN104216920B (en) 2017-11-21

Similar Documents

Publication Publication Date Title
Silva et al. Evaluation of features for leaf discrimination
Lum et al. Extracting insights from the shape of complex data using topology
CN106920206B (en) Steganalysis method based on antagonistic neural network
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN102147858B (en) License plate character identification method
Yue et al. Hashing based fast palmprint identification for large-scale databases
Huang et al. Using glowworm swarm optimization algorithm for clustering analysis
CN109241741B (en) Malicious code classification method based on image texture fingerprints
CN104040561B (en) Pass through the method for the regular identification microorganism of mass spectrometry and fraction
KR101780676B1 (en) Method for learning rejector by forming classification tree in use of training image and detecting object in test image by using the rejector
CN105404886A (en) Feature model generating method and feature model generating device
CN104504412A (en) Method and system for extracting and identifying handwriting stroke features
CN108985065A (en) The Calculate Mahalanobis Distance of application enhancements carries out the method and system of firmware Hole Detection
CN105279506A (en) Manchu script central axis positioning method
CN1959671A (en) Measure of similarity of documentation based on document structure
CN106844337A (en) A kind of contract lacks clause automatic scanning method and system
CN103310205B (en) A kind of Handwritten Numeral Recognition Method and device
CN107886130A (en) A kind of kNN rapid classification methods based on cluster and Similarity-Weighted
Gattal et al. Segmentation and recognition strategy of handwritten connected digits based on the oriented sliding window
CN106874762A (en) Android malicious code detecting method based on API dependence graphs
CN1141665C (en) Micro image characteristic extracting and recognizing method
CN102968622B (en) A kind of TV station symbol recognition method and TV station symbol recognition device
CN104216920A (en) Data classification method based on clustering and Hungary algorithm
Nanni et al. Ensemble to improve gesture recognition
Manimekalai et al. Taxonomic classification of Plant species using support vector machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant