CN104216920B - Data classification method based on cluster and Hungary Algorithm - Google Patents

Data classification method based on cluster and Hungary Algorithm Download PDF

Info

Publication number
CN104216920B
CN104216920B CN201310220527.0A CN201310220527A CN104216920B CN 104216920 B CN104216920 B CN 104216920B CN 201310220527 A CN201310220527 A CN 201310220527A CN 104216920 B CN104216920 B CN 104216920B
Authority
CN
China
Prior art keywords
classification
sample
class
samples
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310220527.0A
Other languages
Chinese (zh)
Other versions
CN104216920A (en
Inventor
胡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Cheerbright Technologies Co Ltd
Original Assignee
Beijing Cheerbright Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Cheerbright Technologies Co Ltd filed Critical Beijing Cheerbright Technologies Co Ltd
Priority to CN201310220527.0A priority Critical patent/CN104216920B/en
Publication of CN104216920A publication Critical patent/CN104216920A/en
Application granted granted Critical
Publication of CN104216920B publication Critical patent/CN104216920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, including:Read original sample collection { X1、X2...XN};By original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate all samples clustered first using clustering method, obtain L+C classification;L classification L known class being assigned to by Hungary Algorithm in L+C classification, will cluster first obtained classification it is corresponding with known class on;By known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep known to classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster again, the sample not marked using object function iteration, the sample for making not mark assigns to some classification or is considered as background noise.Accurately simply data can be classified, and classification results are accurate.

Description

Data classification method based on cluster and Hungary Algorithm
Technical field
The invention belongs to data classification technology field, and in particular to a kind of data based on cluster and Hungary Algorithm are classified Method.
Background technology
During to sample analysis, often the classification of part sample is not, it is known that the sample of known class is a lot, moreover, can The noise that can have powerful connections is not belonging to any classification.
Therefore, for such problem, if using sorting algorithm, it is impossible to generate reliable grader, i.e.,:Caused point Class device possible deviation is larger, the class that nothing does not mark can not be branched away again;If with clustering algorithm, ignore again and marked sample Reference value;Moreover, clustering algorithm can not also solve the problems, such as the treatment classification to background noise.Method relatively is portion Divide semi-supervised learning algorithm, mainly there are two kinds at present:The first, is learnt from having marked and not marked in sample;Second, Sample learning is not marked from positive example and.For the first, it is desirable to the classification marked all has a mark sample, limitation compared with Greatly.And for second, it is two sorting algorithms to positive example and counter-example, can not solves part class and carry out mark, part class not The situation of mark;Can not solve the situation for having powerful connections noise.
The content of the invention
The defects of existing for prior art, the present invention provide a kind of data classification side based on cluster and Hungary Algorithm Method, accurately simply data can be classified, and classification results are accurate.
The technical solution adopted by the present invention is as follows:
The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, comprises the following steps:
S1, read original sample collection { X1、X2...XN};
Original sample collection { X1、X2...XNInclude known classification samples subset { X1、X2...XnAnd unknown classification samples Collect { Xn+1、Xn+2...XN};It is known that classification samples subset { X1、X2...XnIn each sample generic YiRespectively Y1、 Y2...Yn;Known class number is L in known classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2...XNIn unknown classification number be C;
S2, by original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate institute There is sample to be clustered first using clustering method, obtain L+C classification;
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first To classification it is corresponding with known class on;
S4, by known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep Known classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again The sample of note, the sample for making not mark assign to some classification or are considered as background noise.
Preferably, in S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods.
Preferably, in S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again.
Preferably, in S4, the sample that is not marked using object function iteration, the sample for making not mark assign to some classification or It is considered as background noise, is specially:
The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value;When When current iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification.
Preferably, the object function is set as:Degree of polymerization * discriminations in decentralization * classes between class.
Preferably, it is maximum between minimum range or class between mean square distance, class between average distance, class between decentralization is used between the class Distance represents.
Preferably, the degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class.
Preferably, the discrimination expression formula is:Classification number/total number of samples.
Beneficial effects of the present invention are as follows:
The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, suitable for situations below:Part The classification of sample, it is known that the sample point of known classification be not required to a lot, can to have part it is unknown classify i.e. such there is no the sample of mark This point, and the noise i.e. noise point that can have powerful connections is not belonging to any class;Accurately simply data can be classified, and point Class result is accurate.
Brief description of the drawings
Fig. 1 is the data classification method schematic flow sheet provided by the invention based on cluster and Hungary Algorithm;
Fig. 2 is original sample set display figure in embodiment two;
Fig. 3 is that sample it is expected display figure of classifying in embodiment two;
Fig. 4 is sample actual classification display figure in embodiment two.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail:
Embodiment one
S1, read original sample collection { X1、X2...XN};
Original sample collection { X1、X2...XNInclude known classification samples subset { X1、X2...XnAnd unknown classification samples Collect { Xn+1、Xn+2...XN};It is known that classification samples subset { X1、X2...XnIn each sample generic YiRespectively Y1、 Y2...Yn;Known class number is L in known classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2...XNIn unknown classification number be C.
For example, N=650, L=2, C=3, n=50;
Original sample collection is made up of 650 malicious code samples, and the numbering of 650 malicious code samples is respectively:X1、 X2...X650.In this 650 malicious code samples, numbering is respectively X1、X2。。。X5050 malicious code samples classification , it is known that i.e.:50 malicious code samples belong to 2 classifications, are respectively:25 hacker's viruses and 25 macrovirus.Numbering is X51、 X52。。。X650600 malicious code samples classification it is unknown, unknown malicious code sample classification can be any types, example Such as, can be script virus, wooden horse, worm hacker virus or macrovirus, it is of course also possible to be not belonging to any types.
S2, by original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate institute There is sample to be clustered first using clustering method, obtain L+C classification;
In this step, clustering method can use common clustering algorithm, such as KMeans, hierarchical cluster etc..
For above-mentioned malicious code sample collection, when being clustered to 650 samples, 5 classifications are obtained.Require emphasis , in this step, 5 classes are had to by cluster, but do not differentiate between class name, i.e. and which class is not differentiated between as hacker's virus, which Individual class is macrovirus.Of a sort malicious code sample will simply be belonged to be brought together.
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first To classification it is corresponding with known class on.
For example, due to X1、X2。。。X5050 malicious code samples classification, it is known that including hacker virus and macrovirus. Therefore, by assigning, the most class of hacker's virus will be included in 5 classes and is considered hacker's virus type, grand disease will be included in 5 classes The most class of poison is considered macrovirus class.
S4, by known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep Known classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again The sample of note, the sample for making not mark assign to some classification or are considered as background noise.
For example, for 50 malicious code samples known to classification, it is assumed that one co-exists in X1、X2。。X20Totally 20 hacker's diseases Poison, when clustering first, exist X1、X2。。X15Totally 15 hacker's viruses are gathered in hacker's virus type, and by X16、X17..X20 Totally 5 hacker's viruses gather the possibility in other virus types.Therefore, it is necessary to which hacker known to classification is viral after appointment X16、X17。。X20It is divided into the class of its ownership.
Based on common clustering algorithm, such as, it is necessary to use object function iteration after KMeans, hierarchical cluster etc. cluster again The sample not marked, background noise is identified by the way that whether object function reaches extreme value.When current iteration result and last iteration knot When fruit no longer changes, or when object function no longer changes, terminate classification.
Specifically, object function can be set as:Degree of polymerization * discriminations in decentralization * classes between class;These three are joined below Number is introduced respectively:
(1) decentralization between class
Decentralization is according to concrete application between class, may be selected between class between average distance, class between mean square distance, class minimum range or Ultimate range represents between class.
It can be set to:(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have Total distance between sample in total distance of the sample of classification-all kinds of)/(all total distances of sample for having classification) * is all has point Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification) of class.
Such as:The average distance of sample refers between inhomogeneity:If co-exist in hacker's virus, macrovirus, script virus, wooden horse With five classes of worm,
Xi, Xj represent sample;Dij represents the distance between Xi and Xj;Yi represents the class numbering that Xi is assigned to, and Yj represents Xj quilts The class numbering assigned to, if Yi=0 represents Xi regardless of to any class, Yi values are in 1~(L+C) when Xi is assigned to some class Certain number, then:
Average distance refers between all samples for having a classification:To Yi > 0 and Yj > 0 all i, j, dij average value is sought;
All total distances of sample for having classification refer to:To Yi > 0 and Yj > 0 all i, j, dij summation is sought;
All sample numbers for having classification refer to:To 0 all i of Yi > number;
Total distance between sample in all kinds of refers to:To Yi > 0, Yj > 0 and Yi=Yj all i, j ask that dij's is total With;
The average distance of sample refers between inhomogeneity:To Yi > 0, Yj > 0 and Yi ≠ Yj all i, j, being averaged for dij is asked Value.
(2) degree of polymerization in class
The degree of polymerization can use similarity in class, it is also possible to which the negative exponent form of distance represents that the degree of polymerization can be used in class in class Ultimate range represents in mean square distance or class in average distance, class.
(3) discrimination
Discrimination simplest form is:One function of classification number/total number of samples or this ratio.
Such as:For above-mentioned malicious code sample collection, total number of samples is 650, if 650 malicious code samples have Class identifies, then discrimination is 100%;If 10 malicious code samples are considered as being not belonging to any class, then discrimination is (650-10)/650。
Property:Decentralization is the bigger the better between class, and the degree of polymerization is the bigger the better in class, and discrimination is the higher the better.
Embodiment two
The data classification method based on cluster and Hungary Algorithm provided using the embodiment of the present invention one, is described below one Kind concrete application.
As shown in Fig. 2 it is original sample collection, altogether including 1000 samples and 100 noise points;Wherein, original sample collection Substantially divide into 4 classes, " six " of the shape similar to Chinese character.Respectively there are 10 samples marked to " slash " below " six " word, " right-falling stroke " This, sees each sample pointed by A, B in Fig. 2, has as marked sample;Wherein, one classification of the sample representation pointed by A, B Another classification of pointed sample representation." point ", " horizontal stroke " above is the sample not marked.
As shown in figure 3, it is the desired classification results of Cluster Classification, it is desirable to 4 classes are marked off to come, i.e. with circle in Fig. 3 Shape or ellipse enclose the sample come, and other samples being scattered in outside circular or ellipse are not done as background noise sample returns Class.
By using the data classification method provided by the invention based on cluster and Hungary Algorithm, obtain as shown in Figure 4 Classification results, from fig. 4, it can be seen that the present invention have identified four stroke parts of " six " substantially, and filtered out background The sample of noise.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should Depending on protection scope of the present invention.

Claims (1)

1. a kind of data classification method based on cluster and Hungary Algorithm, it is characterised in that comprise the following steps:
S1, read original sample collection { X1、X2…XN};
Original sample collection { X1、X2…XNInclude known classification samples subset { X1、X2…XnAnd unknown classification samples subset { Xn+1、 Xn+2…XN};It is known that classification samples subset { X1、X2…XnIn each sample generic YiRespectively Y1、Y2…Yn; Know that known class number is L in classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2…XNIn unknown classification number be C;
S2, by original sample collection { X1、X2…XNIn all samples be considered as no classification samples, to original sample concentrate all samples This is clustered first using clustering method, obtains L+C classification;
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, cluster it will obtain first On classification is corresponding with known class;
S4, by known classification samples subset { X1、X2…XnIn each sample be divided into its ownership class in, then keep known to Classification samples subset { X1、X2…XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again Sample, the sample for making not mark assign to some classification or are considered as background noise;
In S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods;
In S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again;
In S4, the sample that is not marked using object function iteration, the sample for making not mark is assigned to some classification or is considered as background and made an uproar Sound, it is specially:
The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value;When this When iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification;
The object function is set as:Degree of polymerization * discriminations in decentralization * classes between class;
Ultimate range represents decentralization between minimum range or class between mean square distance, class between average distance, class between the class;
It can be set to:(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have classification Sample total distance-all kinds of in sample between total distance)/(all total distances of sample for having classification) * is all classification Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification);
Average distance refers between all samples for having a classification:To Yi>0 and Yj>0 all i, j, seek dij average value;
All total distances of sample for having classification refer to:To Yi>0 and Yj>0 all i, j, seek dij summation;
All sample numbers for having classification refer to:To Yi>0 all i number;
Total distance between sample in all kinds of refers to:To Yi>0,Yj>0 and Yi=Yj all i, j, seek dij summation;
The average distance of sample refers between inhomogeneity:To Yi>0,Yj>0 and Yi ≠ Yj all i, j, seek dij average value;
The degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class;
The discrimination expression formula is:Classification number/total number of samples.
CN201310220527.0A 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm Active CN104216920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310220527.0A CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310220527.0A CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Publications (2)

Publication Number Publication Date
CN104216920A CN104216920A (en) 2014-12-17
CN104216920B true CN104216920B (en) 2017-11-21

Family

ID=52098417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310220527.0A Active CN104216920B (en) 2013-06-05 2013-06-05 Data classification method based on cluster and Hungary Algorithm

Country Status (1)

Country Link
CN (1) CN104216920B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216884A (en) * 2007-12-29 2008-07-09 北京中星微电子有限公司 A method and system for face authentication
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8452770B2 (en) * 2010-07-15 2013-05-28 Xerox Corporation Constrained nonnegative tensor factorization for clustering
US8832655B2 (en) * 2011-09-29 2014-09-09 Accenture Global Services Limited Systems and methods for finding project-related information by clustering applications into related concept categories

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101216884A (en) * 2007-12-29 2008-07-09 北京中星微电子有限公司 A method and system for face authentication
CN102651088A (en) * 2012-04-09 2012-08-29 南京邮电大学 Classification method for malicious code based on A_Kohonen neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
智能降维技术的研究与应用;安亚静;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第07期);文章摘要页,第5-12页 *

Also Published As

Publication number Publication date
CN104216920A (en) 2014-12-17

Similar Documents

Publication Publication Date Title
RU2018142757A (en) SYSTEM AND METHOD FOR DETECTING PLANT DISEASES
Carmichael et al. Shape-based recognition of wiry objects
Yang et al. Rapid detection of rice disease using microscopy image identification based on the synergistic judgment of texture and shape features and decision tree–confusion matrix method
CN109409400A (en) Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass
Ross et al. Exploiting the “doddington zoo” effect in biometric fusion
CN107066951B (en) Face spontaneous expression recognition method and system
CN107145778B (en) Intrusion detection method and device
CN101923652A (en) Pornographic picture identification method based on joint detection of skin colors and featured body parts
JP2017224278A (en) Method of letting rejector learn by constituting classification tree utilizing training image and detecting object on test image utilizing rejector
CN106506528A (en) A kind of Network Safety Analysis system under big data environment
CN102129574B (en) A kind of face authentication method and system
CN104091178A (en) Method for training human body sensing classifier based on HOG features
Babu et al. Handwritten digit recognition using structural, statistical features and k-nearest neighbor classifier
CN110009005A (en) A kind of net flow assorted method based on feature strong correlation
Rafea et al. Classification of a COVID-19 dataset by using labels created from clustering algorithms
Mörzinger et al. Visual Structure Analysis of Flow Charts in Patent Images.
Le et al. Document retrieval based on logo spotting using key-point matching
CN104216920B (en) Data classification method based on cluster and Hungary Algorithm
Gattal et al. Segmentation and recognition strategy of handwritten connected digits based on the oriented sliding window
CN110032973A (en) A kind of unsupervised helminth classification method and system based on artificial intelligence
Xu et al. Scene text detection based on robust stroke width transform and deep belief network
US9811726B2 (en) Chinese, Japanese, or Korean language detection
CN101840510B (en) Adaptive enhancement face authentication method based on cost sensitivity
JP7341962B2 (en) Learning data collection device, learning device, learning data collection method and program
Ray Extracting region of interest for palm print authentication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant