CN104216920B - Data classification method based on cluster and Hungary Algorithm - Google Patents
Data classification method based on cluster and Hungary Algorithm Download PDFInfo
- Publication number
- CN104216920B CN104216920B CN201310220527.0A CN201310220527A CN104216920B CN 104216920 B CN104216920 B CN 104216920B CN 201310220527 A CN201310220527 A CN 201310220527A CN 104216920 B CN104216920 B CN 104216920B
- Authority
- CN
- China
- Prior art keywords
- classification
- sample
- class
- samples
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 22
- 239000012141 concentrate Substances 0.000 claims abstract description 4
- 238000006116 polymerization reaction Methods 0.000 claims description 9
- 241000700605 Viruses Species 0.000 description 13
- 230000006870 function Effects 0.000 description 12
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002574 poison Substances 0.000 description 2
- 231100000614 poison Toxicity 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, including:Read original sample collection { X1、X2...XN};By original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate all samples clustered first using clustering method, obtain L+C classification;L classification L known class being assigned to by Hungary Algorithm in L+C classification, will cluster first obtained classification it is corresponding with known class on;By known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep known to classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster again, the sample not marked using object function iteration, the sample for making not mark assigns to some classification or is considered as background noise.Accurately simply data can be classified, and classification results are accurate.
Description
Technical field
The invention belongs to data classification technology field, and in particular to a kind of data based on cluster and Hungary Algorithm are classified
Method.
Background technology
During to sample analysis, often the classification of part sample is not, it is known that the sample of known class is a lot, moreover, can
The noise that can have powerful connections is not belonging to any classification.
Therefore, for such problem, if using sorting algorithm, it is impossible to generate reliable grader, i.e.,:Caused point
Class device possible deviation is larger, the class that nothing does not mark can not be branched away again;If with clustering algorithm, ignore again and marked sample
Reference value;Moreover, clustering algorithm can not also solve the problems, such as the treatment classification to background noise.Method relatively is portion
Divide semi-supervised learning algorithm, mainly there are two kinds at present:The first, is learnt from having marked and not marked in sample;Second,
Sample learning is not marked from positive example and.For the first, it is desirable to the classification marked all has a mark sample, limitation compared with
Greatly.And for second, it is two sorting algorithms to positive example and counter-example, can not solves part class and carry out mark, part class not
The situation of mark;Can not solve the situation for having powerful connections noise.
The content of the invention
The defects of existing for prior art, the present invention provide a kind of data classification side based on cluster and Hungary Algorithm
Method, accurately simply data can be classified, and classification results are accurate.
The technical solution adopted by the present invention is as follows:
The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, comprises the following steps:
S1, read original sample collection { X1、X2...XN};
Original sample collection { X1、X2...XNInclude known classification samples subset { X1、X2...XnAnd unknown classification samples
Collect { Xn+1、Xn+2...XN};It is known that classification samples subset { X1、X2...XnIn each sample generic YiRespectively Y1、
Y2...Yn;Known class number is L in known classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2...XNIn unknown classification number be C;
S2, by original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate institute
There is sample to be clustered first using clustering method, obtain L+C classification;
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first
To classification it is corresponding with known class on;
S4, by known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep
Known classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again
The sample of note, the sample for making not mark assign to some classification or are considered as background noise.
Preferably, in S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods.
Preferably, in S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again.
Preferably, in S4, the sample that is not marked using object function iteration, the sample for making not mark assign to some classification or
It is considered as background noise, is specially:
The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value;When
When current iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification.
Preferably, the object function is set as:Degree of polymerization * discriminations in decentralization * classes between class.
Preferably, it is maximum between minimum range or class between mean square distance, class between average distance, class between decentralization is used between the class
Distance represents.
Preferably, the degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class.
Preferably, the discrimination expression formula is:Classification number/total number of samples.
Beneficial effects of the present invention are as follows:
The present invention provides a kind of data classification method based on cluster and Hungary Algorithm, suitable for situations below:Part
The classification of sample, it is known that the sample point of known classification be not required to a lot, can to have part it is unknown classify i.e. such there is no the sample of mark
This point, and the noise i.e. noise point that can have powerful connections is not belonging to any class;Accurately simply data can be classified, and point
Class result is accurate.
Brief description of the drawings
Fig. 1 is the data classification method schematic flow sheet provided by the invention based on cluster and Hungary Algorithm;
Fig. 2 is original sample set display figure in embodiment two;
Fig. 3 is that sample it is expected display figure of classifying in embodiment two;
Fig. 4 is sample actual classification display figure in embodiment two.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail:
Embodiment one
S1, read original sample collection { X1、X2...XN};
Original sample collection { X1、X2...XNInclude known classification samples subset { X1、X2...XnAnd unknown classification samples
Collect { Xn+1、Xn+2...XN};It is known that classification samples subset { X1、X2...XnIn each sample generic YiRespectively Y1、
Y2...Yn;Known class number is L in known classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2...XNIn unknown classification number be C.
For example, N=650, L=2, C=3, n=50;
Original sample collection is made up of 650 malicious code samples, and the numbering of 650 malicious code samples is respectively:X1、
X2...X650.In this 650 malicious code samples, numbering is respectively X1、X2。。。X5050 malicious code samples classification
, it is known that i.e.:50 malicious code samples belong to 2 classifications, are respectively:25 hacker's viruses and 25 macrovirus.Numbering is X51、
X52。。。X650600 malicious code samples classification it is unknown, unknown malicious code sample classification can be any types, example
Such as, can be script virus, wooden horse, worm hacker virus or macrovirus, it is of course also possible to be not belonging to any types.
S2, by original sample collection { X1、X2...XNIn all samples be considered as no classification samples, to original sample concentrate institute
There is sample to be clustered first using clustering method, obtain L+C classification;
In this step, clustering method can use common clustering algorithm, such as KMeans, hierarchical cluster etc..
For above-mentioned malicious code sample collection, when being clustered to 650 samples, 5 classifications are obtained.Require emphasis
, in this step, 5 classes are had to by cluster, but do not differentiate between class name, i.e. and which class is not differentiated between as hacker's virus, which
Individual class is macrovirus.Of a sort malicious code sample will simply be belonged to be brought together.
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, will be clustered first
To classification it is corresponding with known class on.
For example, due to X1、X2。。。X5050 malicious code samples classification, it is known that including hacker virus and macrovirus.
Therefore, by assigning, the most class of hacker's virus will be included in 5 classes and is considered hacker's virus type, grand disease will be included in 5 classes
The most class of poison is considered macrovirus class.
S4, by known classification samples subset { X1、X2...XnIn each sample be divided into its ownership class in, then keep
Known classification samples subset { X1、X2...XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again
The sample of note, the sample for making not mark assign to some classification or are considered as background noise.
For example, for 50 malicious code samples known to classification, it is assumed that one co-exists in X1、X2。。X20Totally 20 hacker's diseases
Poison, when clustering first, exist X1、X2。。X15Totally 15 hacker's viruses are gathered in hacker's virus type, and by X16、X17..X20
Totally 5 hacker's viruses gather the possibility in other virus types.Therefore, it is necessary to which hacker known to classification is viral after appointment
X16、X17。。X20It is divided into the class of its ownership.
Based on common clustering algorithm, such as, it is necessary to use object function iteration after KMeans, hierarchical cluster etc. cluster again
The sample not marked, background noise is identified by the way that whether object function reaches extreme value.When current iteration result and last iteration knot
When fruit no longer changes, or when object function no longer changes, terminate classification.
Specifically, object function can be set as:Degree of polymerization * discriminations in decentralization * classes between class;These three are joined below
Number is introduced respectively:
(1) decentralization between class
Decentralization is according to concrete application between class, may be selected between class between average distance, class between mean square distance, class minimum range or
Ultimate range represents between class.
It can be set to:(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have
Total distance between sample in total distance of the sample of classification-all kinds of)/(all total distances of sample for having classification) * is all has point
Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification) of class.
Such as:The average distance of sample refers between inhomogeneity:If co-exist in hacker's virus, macrovirus, script virus, wooden horse
With five classes of worm,
Xi, Xj represent sample;Dij represents the distance between Xi and Xj;Yi represents the class numbering that Xi is assigned to, and Yj represents Xj quilts
The class numbering assigned to, if Yi=0 represents Xi regardless of to any class, Yi values are in 1~(L+C) when Xi is assigned to some class
Certain number, then:
Average distance refers between all samples for having a classification:To Yi > 0 and Yj > 0 all i, j, dij average value is sought;
All total distances of sample for having classification refer to:To Yi > 0 and Yj > 0 all i, j, dij summation is sought;
All sample numbers for having classification refer to:To 0 all i of Yi > number;
Total distance between sample in all kinds of refers to:To Yi > 0, Yj > 0 and Yi=Yj all i, j ask that dij's is total
With;
The average distance of sample refers between inhomogeneity:To Yi > 0, Yj > 0 and Yi ≠ Yj all i, j, being averaged for dij is asked
Value.
(2) degree of polymerization in class
The degree of polymerization can use similarity in class, it is also possible to which the negative exponent form of distance represents that the degree of polymerization can be used in class in class
Ultimate range represents in mean square distance or class in average distance, class.
(3) discrimination
Discrimination simplest form is:One function of classification number/total number of samples or this ratio.
Such as:For above-mentioned malicious code sample collection, total number of samples is 650, if 650 malicious code samples have
Class identifies, then discrimination is 100%;If 10 malicious code samples are considered as being not belonging to any class, then discrimination is
(650-10)/650。
Property:Decentralization is the bigger the better between class, and the degree of polymerization is the bigger the better in class, and discrimination is the higher the better.
Embodiment two
The data classification method based on cluster and Hungary Algorithm provided using the embodiment of the present invention one, is described below one
Kind concrete application.
As shown in Fig. 2 it is original sample collection, altogether including 1000 samples and 100 noise points;Wherein, original sample collection
Substantially divide into 4 classes, " six " of the shape similar to Chinese character.Respectively there are 10 samples marked to " slash " below " six " word, " right-falling stroke "
This, sees each sample pointed by A, B in Fig. 2, has as marked sample;Wherein, one classification of the sample representation pointed by A, B
Another classification of pointed sample representation." point ", " horizontal stroke " above is the sample not marked.
As shown in figure 3, it is the desired classification results of Cluster Classification, it is desirable to 4 classes are marked off to come, i.e. with circle in Fig. 3
Shape or ellipse enclose the sample come, and other samples being scattered in outside circular or ellipse are not done as background noise sample returns
Class.
By using the data classification method provided by the invention based on cluster and Hungary Algorithm, obtain as shown in Figure 4
Classification results, from fig. 4, it can be seen that the present invention have identified four stroke parts of " six " substantially, and filtered out background
The sample of noise.
Described above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
Depending on protection scope of the present invention.
Claims (1)
1. a kind of data classification method based on cluster and Hungary Algorithm, it is characterised in that comprise the following steps:
S1, read original sample collection { X1、X2…XN};
Original sample collection { X1、X2…XNInclude known classification samples subset { X1、X2…XnAnd unknown classification samples subset { Xn+1、
Xn+2…XN};It is known that classification samples subset { X1、X2…XnIn each sample generic YiRespectively Y1、Y2…Yn;
Know that known class number is L in classification samples subset;
Unknown classification samples subset { Xn+1、Xn+2…XNIn unknown classification number be C;
S2, by original sample collection { X1、X2…XNIn all samples be considered as no classification samples, to original sample concentrate all samples
This is clustered first using clustering method, obtains L+C classification;
S3, L classification L known class being assigned to by Hungary Algorithm in L+C classification, cluster it will obtain first
On classification is corresponding with known class;
S4, by known classification samples subset { X1、X2…XnIn each sample be divided into its ownership class in, then keep known to
Classification samples subset { X1、X2…XnIn each affiliated class of sample it is constant, cluster, do not marked using object function iteration again
Sample, the sample for making not mark assign to some classification or are considered as background noise;
In S2, the clustering method is KMeans clustering methods or Hierarchical clustering methods;
In S4, used clustering method is KMeans clustering methods or Hierarchical clustering methods when clustering again;
In S4, the sample that is not marked using object function iteration, the sample for making not mark is assigned to some classification or is considered as background and made an uproar
Sound, it is specially:
The sample not marked using object function iteration, background noise is identified by the way that whether object function reaches extreme value;When this
When iteration result and last iteration result no longer change, or when object function no longer changes, terminate classification;
The object function is set as:Degree of polymerization * discriminations in decentralization * classes between class;
Ultimate range represents decentralization between minimum range or class between mean square distance, class between average distance, class between the class;
It can be set to:(average distance of sample between inhomogeneity)/(average distance between all samples for having a classification)=(all to have classification
Sample total distance-all kinds of in sample between total distance)/(all total distances of sample for having classification) * is all classification
Sample number * (all sample numbers -1 for having classification)/∑ (the sample number * (sample number -1 of certain classification) of certain classification);
Average distance refers between all samples for having a classification:To Yi>0 and Yj>0 all i, j, seek dij average value;
All total distances of sample for having classification refer to:To Yi>0 and Yj>0 all i, j, seek dij summation;
All sample numbers for having classification refer to:To Yi>0 all i number;
Total distance between sample in all kinds of refers to:To Yi>0,Yj>0 and Yi=Yj all i, j, seek dij summation;
The average distance of sample refers between inhomogeneity:To Yi>0,Yj>0 and Yi ≠ Yj all i, j, seek dij average value;
The degree of polymerization is represented with ultimate range in mean square distance in average distance in class, class or class in the class;
The discrimination expression formula is:Classification number/total number of samples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310220527.0A CN104216920B (en) | 2013-06-05 | 2013-06-05 | Data classification method based on cluster and Hungary Algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310220527.0A CN104216920B (en) | 2013-06-05 | 2013-06-05 | Data classification method based on cluster and Hungary Algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104216920A CN104216920A (en) | 2014-12-17 |
CN104216920B true CN104216920B (en) | 2017-11-21 |
Family
ID=52098417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310220527.0A Active CN104216920B (en) | 2013-06-05 | 2013-06-05 | Data classification method based on cluster and Hungary Algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104216920B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101216884A (en) * | 2007-12-29 | 2008-07-09 | 北京中星微电子有限公司 | A method and system for face authentication |
CN101350011A (en) * | 2007-07-18 | 2009-01-21 | 中国科学院自动化研究所 | Method for detecting search engine cheat based on small sample set |
CN102651088A (en) * | 2012-04-09 | 2012-08-29 | 南京邮电大学 | Classification method for malicious code based on A_Kohonen neural network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8452770B2 (en) * | 2010-07-15 | 2013-05-28 | Xerox Corporation | Constrained nonnegative tensor factorization for clustering |
US8832655B2 (en) * | 2011-09-29 | 2014-09-09 | Accenture Global Services Limited | Systems and methods for finding project-related information by clustering applications into related concept categories |
-
2013
- 2013-06-05 CN CN201310220527.0A patent/CN104216920B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350011A (en) * | 2007-07-18 | 2009-01-21 | 中国科学院自动化研究所 | Method for detecting search engine cheat based on small sample set |
CN101216884A (en) * | 2007-12-29 | 2008-07-09 | 北京中星微电子有限公司 | A method and system for face authentication |
CN102651088A (en) * | 2012-04-09 | 2012-08-29 | 南京邮电大学 | Classification method for malicious code based on A_Kohonen neural network |
Non-Patent Citations (1)
Title |
---|
智能降维技术的研究与应用;安亚静;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第07期);文章摘要页,第5-12页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104216920A (en) | 2014-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2018142757A (en) | SYSTEM AND METHOD FOR DETECTING PLANT DISEASES | |
Carmichael et al. | Shape-based recognition of wiry objects | |
Yang et al. | Rapid detection of rice disease using microscopy image identification based on the synergistic judgment of texture and shape features and decision tree–confusion matrix method | |
CN109409400A (en) | Merge density peaks clustering method, image segmentation system based on k nearest neighbor and multiclass | |
Ross et al. | Exploiting the “doddington zoo” effect in biometric fusion | |
CN107066951B (en) | Face spontaneous expression recognition method and system | |
CN107145778B (en) | Intrusion detection method and device | |
CN101923652A (en) | Pornographic picture identification method based on joint detection of skin colors and featured body parts | |
JP2017224278A (en) | Method of letting rejector learn by constituting classification tree utilizing training image and detecting object on test image utilizing rejector | |
CN106506528A (en) | A kind of Network Safety Analysis system under big data environment | |
CN102129574B (en) | A kind of face authentication method and system | |
CN104091178A (en) | Method for training human body sensing classifier based on HOG features | |
Babu et al. | Handwritten digit recognition using structural, statistical features and k-nearest neighbor classifier | |
CN110009005A (en) | A kind of net flow assorted method based on feature strong correlation | |
Rafea et al. | Classification of a COVID-19 dataset by using labels created from clustering algorithms | |
Mörzinger et al. | Visual Structure Analysis of Flow Charts in Patent Images. | |
Le et al. | Document retrieval based on logo spotting using key-point matching | |
CN104216920B (en) | Data classification method based on cluster and Hungary Algorithm | |
Gattal et al. | Segmentation and recognition strategy of handwritten connected digits based on the oriented sliding window | |
CN110032973A (en) | A kind of unsupervised helminth classification method and system based on artificial intelligence | |
Xu et al. | Scene text detection based on robust stroke width transform and deep belief network | |
US9811726B2 (en) | Chinese, Japanese, or Korean language detection | |
CN101840510B (en) | Adaptive enhancement face authentication method based on cost sensitivity | |
JP7341962B2 (en) | Learning data collection device, learning device, learning data collection method and program | |
Ray | Extracting region of interest for palm print authentication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |