CN105320968A - Improved method for centroid classifier - Google Patents
Improved method for centroid classifier Download PDFInfo
- Publication number
- CN105320968A CN105320968A CN201510801697.7A CN201510801697A CN105320968A CN 105320968 A CN105320968 A CN 105320968A CN 201510801697 A CN201510801697 A CN 201510801697A CN 105320968 A CN105320968 A CN 105320968A
- Authority
- CN
- China
- Prior art keywords
- data
- classification
- parameter
- barycenter
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses an improved method for a centroid classifier. The method comprises the following steps: a. selecting a training set and a test set, and constructing a centroid; b. computing a similarity degree of to-be-classified data and a centroid vector; c. initializing a parameter k, and setting the parameter k for each class, and adjusting a position of a classification face by adjusting the parameter k; d. training the parameter k; and e. applying the trained parameter k to the to-be-classified data. The improved method for the centroid classifier disclosed by the present invention improves classification accuracy of an original CBC on unbalanced classes and can be used for data classification.
Description
Technical field
The present invention relates to Computer Science and Technology field, relate to data classification method, can be used in the application of internet data classification, efficiency and the accuracy rate of information retrieval can be improved.
Background technology
Along with the development of infotech, the information that people can obtain presents explosive growth.In the face of increasing magnanimity information, how to obtain the data required for people fast and effectively, only rely on artificial mode to become more and more difficult to process these information.Need the aid of some robotizations to help people better manage and filter these information, CBC (based on barycenter sorter) is exactly more excellent data sorter.
Sorting technique based on barycenter is one of sorting technique of classics, and its thinking is simple, namely sorts out text according to data characteristics vector with the similarity of class centroid.Based on barycenter classification easy to understand and realization, be better than to stable performance NaiveBayes, KNN and C4.5 traditional decision-tree, and algorithm complex and text set scale linear, classification effectiveness is high, occurs that the probability of Expired Drugs is low.
As author be the beautiful plum of bavin, Zhu Guochong, Zan Hongying, Khuda is bright, Xian Jiayang equals on the periodical of periodical " computer engineering " by name, delivered autograph in October, 2009 is " Algorithm of documents categorization based on barycenter " paper, this paper to the effect that " when text set comparatively disperses or occurs multi-peak, the Algorithm of documents categorization classifying quality based on barycenter is very poor.Propose a kind of Algorithm of documents categorization of improvement for this problem, compared with the classical taxonomy algorithm based on barycenter, its performance is higher.The test result of UCK algorithm on text classification corpus provided in Hong Kong Hui Kexun industry company shows, efficiency and the precision of this algorithm meet the demands.
But the deficiency of traditional sorting technique based on barycenter that above review Wen Wei represents also is apparent: centroid vector calculates according to all proper vectors belonging to this classification text marked, comparatively disperse when belonging to the distribution of other text of same class, or when having overlap between classification, its classifying quality is poor.
Summary of the invention
The present invention is intended to for the defect existing for above-mentioned prior art and deficiency, improving one's methods of a kind of barycenter sorter is provided, this method is to each classification increase parameter, do not reducing on the basis of classification effectiveness, classifying quality is significantly promoted, solves the problem that original barycenter sorter is low to uneven class classification accuracy rate.
The present invention realizes by adopting following technical proposals:
Improving one's methods of a kind of barycenter sorter, is characterized in that step is as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
Described a step specifically refers to:
There will be a known the training set D of p classification information
tR, data set D to be sorted
tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w
1, w
2..., w
n), wherein w
1, w
2, w
nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C
jin the arithmetic mean of all data, vectorial C
jcomputing formula be
wherein S is classification C
jthe number of middle sample, x
ithen classification C
jin the vector of sample;
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C
jsimilarity adopt cosine similarity measure:
Described step c specifically refers to:
Initiation parameter k is each classification C
j, j ∈ (1, p) parameters k
j, now k
1=k
2=...=k
p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k
j* Sim (x, C
j)), j ∈ (1, p) to data x (the x ∈ C in training set
i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C
iin data assign to class C by mistake
jin, increase k
i, reduce k
j(wherein k
i, k
jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k
i, k
jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k
jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k
j* Sim (x, C
j)), by data set D to be sorted
tEcarry out classification application.
Compared with prior art, the beneficial effect that reaches of the present invention is as follows:
By the step of a-e of the present invention, compared with prior art, it remains the linear feature of CBC sorting technique, and complexity is low, and classification speed is fast; Improve the accuracy of CBC sorting technique; Barycenter of the present invention is constant, adjusts classifying face, there will not be the phenomenon of centroid motion by parameter k;
Accompanying drawing explanation
Below in conjunction with specification drawings and specific embodiments, the present invention is described in further detail, wherein:
Fig. 1 illustrates original CBC principle of classification figure;
Fig. 2 illustrates that the present invention improves CBC principle of classification figure.
Embodiment
Embodiment 1
Fundamental purpose of the present invention is, to the improvement based on barycenter sorting technique (CBC), to make up the deficiency of original barycenter sorting technique, make classification results better.
CBC original classification process:
(2) barycenter is built.Use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w
1, w
2..., w
n), and construct by methods such as AAC the vector that can represent a certain class data, i.e. barycenter.Wherein, ArithmeticalAverageCentroid (AAC), barycenter is classification C
jin the arithmetic mean of all data, vectorial C
jcomputing formula be
wherein S is classification C
jthe number of middle sample, x
ithen classification C
jin the vector of sample;
(2) similarity of data to be sorted and centroid vector is calculated.Calculate data x to be sorted and centroid vector C
jsimilarity adopt cosine similarity measure:
(3) classify.The possibility that these data of the higher explanation of similarity of data and a certain classification belong to such is larger, these data is assigned to Class (x)=argmaxSim (x, C in the highest classification of similarity
j).
In CBC original classification method, classifying face is the middle vertical plane y of two class barycenter a, b lines, and when the scope of two classes is more or less the same, classifying quality is better, as Fig. 1; When the scope difference of two classes is larger, according to the classifying face y of CBC original classification method, the dash area in Fig. 2 will be divided by mistake.The present invention just improves on the basis of CBC sorter, increases a parameter k to each class, the position of classifying face y is adjusted to effect that y' reaches optimization.
The training process of parameter k:
1. initialization k
j, j ∈ (1, p) (p is the number of class in data set S); Build barycenter, calculate data x to be sorted and centroid vector C
jsimilarity adopt cosine similarity measure:
be each class C simultaneously
jparameters k
j, now k
1=k
2=...=k
p=1;
2. couple training data x (x ∈ C
i, i ∈ (1, p)) basis
Class (x)=argmax (k
j* Sim (x, C
j)), (1, p) classify, classification correctly goes to step 4 to j ∈, otherwise goes to step 3;
3. superimpose data x divides to class C by mistake
jin, increase k
i, reduce k
j(wherein k
i, k
jincrease the fixed value reduced to arrange voluntarily according to actual conditions);
4. repeat step 2,3, when classification accuracy rate floats within the specific limits, can parameter k be obtained
jend value.
In the present invention, each class has himself parameter k, and in the process of training, if classification error appears in data, then two classes (that is: class C of really belonging to of data of contact occurs in adjustment
iwith misjudged class C
j) parameter k
i, k
j, thus the classifying face between class and class is adjusted, to reach the object improving classification accuracy rate.Meanwhile, in the present invention, barycenter remains unchanged, and is changed the position of classifying face by the size adjusting parameter k.
Concrete steps are as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
Described a step specifically refers to:
There will be a known the training set D of p classification information
tR, data set D to be sorted
tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w
1, w
2..., w
n), wherein w
1, w
2, w
nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C
jin the arithmetic mean of all data
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C
jsimilarity adopt cosine similarity measure:
Described step c specifically refers to:
Initiation parameter k is each classification C
j, j ∈ (1, p) parameters k
j, now k
1=k
2=...=k
p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k
j* Sim (x, C
j)), j ∈ (1, p) to data x (the x ∈ C in training set
i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C
iin data assign to class C by mistake
jin, increase k
i, reduce k
j(wherein k
i, k
jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k
i, k
jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k
jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k
j* Sim (x, C
j)), by data set D to be sorted
tEcarry out classification application.
Embodiment 2 experimental example
On hands-on collection, test below in conjunction with CBC innovatory algorithm that the invention will be further described:
The general thought of experiment is, the method after using original CBC method and improving is tested training set, and uses the method for cross validation to training set classification and assessment.
1. experiment material and index introduction:
Training set is Ohsumed (23 classes totally 13929 samples), Reuters-21578 (90 classes totally 21578 samples);
Experimental index: grand accurate rate macro_p, grand recall rate macro_r, grand average macro_F1 (mean value of macro_p and macro_r), micro-accurate rate micro_p, micro-recall rate micro_r, micro_F1 (mean value of micro_p and micro_r), above index value is higher, and to represent classifying quality better;
2. experiment flow
(1) training set is divided into 5 parts at random, in turn will wherein 4 parts train, verify for 1 part, i.e. cross-validation method;
(2) original CBC classification is used to carry out cross validation to data;
(3) the CBC classification after improving is used to carry out cross validation to data;
(4) to two kinds of classification Comparative result;
3. experimental result
Experimental result is as shown in the table, as can be seen from the table, improves CBC classification and is better than original CBC classification to two training set data classification results.Prove by experiment, the present invention is remarkable to original CBC classification improvement effect.
Reuters-21578---original CBC classification results
macro_p | macro_r | macro_F1 | micro_p | micro_r | micro_F1 | |
0 | 0.713 | 0.838 | 0.771 | 0.833 | 0.833 | 0.833 |
1 | 0.71 | 0.84 | 0.769 | 0.801 | 0.801 | 0.801 |
2 | 0.709 | 0.838 | 0.768 | 0.816 | 0.816 | 0.816 |
3 | 0.674 | 0.859 | 0.755 | 0.784 | 0.784 | 0.784 |
4 | 0.648 | 0.813 | 0.721 | 0.783 | 0.783 | 0.783 |
average | 0.691 | 0.838 | 0.757 | 0.803 | 0.803 | 0.803 |
Reuters-21578---improve CBC classification results
macro_p | macro_r | macro_F1 | micro_p | micro_r | micro_F1 | |
0 | 0.788 | 0.819 | 0.803 | 0.911 | 0.911 | 0.911 |
1 | 0.821 | 0.845 | 0.833 | 0.909 | 0.909 | 0.909 |
2 | 0.817 | 0.828 | 0.822 | 0.911 | 0.911 | 0.911 |
3 | 0.775 | 0.874 | 0.821 | 0.924 | 0.924 | 0.924 |
4 | 0.768 | 0.79 | 0.779 | 0.885 | 0.885 | 0.885 |
average | 0.794 | 0.831 | 0.812 | 0.908 | 0.908 | 0.908 |
Ohsumed---original CBC classification results
macro_p | macro_r | macro_F1 | micro_p | micro_r | micro_F1 | |
0 | 0.758 | 0.758 | 0.758 | 0.761 | 0.761 | 0.761 |
1 | 0.769 | 0.767 | 0.768 | 0.77 | 0.77 | 0.77 |
2 | 0.774 | 0.775 | 0.774 | 0.777 | 0.777 | 0.777 |
3 | 0.745 | 0.749 | 0.747 | 0.754 | 0.754 | 0.754 |
4 | 0.743 | 0.742 | 0.743 | 0.744 | 0.744 | 0.744 |
average | 0.758 | 0.758 | 0.758 | 0.761 | 0.761 | 0.761 |
Ohsumed---improve CBC classification results
macro_p | macro_r | macro_F1 | micro_p | micro_r | micro_F1 | |
0 | 0.76 | 0.76 | 0.76 | 0.767 | 0.767 | 0.767 |
1 | 0.778 | 0.777 | 0.777 | 0.785 | 0.785 | 0.785 |
2 | 0.777 | 0.777 | 0.779 | 0.789 | 0.789 | 0.789 |
3 | 0.748 | 0.752 | 0.75 | 0.765 | 0.765 | 0.765 |
4 | 0.751 | 0.749 | 0.75 | 0.758 | 0.758 | 0.758 |
average | 0.763 | 0.764 | 0.763 | 0.773 | 0.773 | 0.773 |
Main abbreviation:
CBC: classify based on barycenter;
VSM: vector space model;
AAC: arithmetic mean barycenter.
Claims (2)
1. the improving one's methods of a barycenter sorter, is characterized in that step is as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
2. the improving one's methods of a kind of barycenter sorter according to claim 1, is characterized in that:
Described a step specifically refers to:
There will be a known the training set D of p classification information
tR, data set D to be sorted
tE; Build barycenter: use vector space model model, each data representation is become a corresponding vector x=(w
1, w
2..., w
n), wherein w
1, w
2, w
nbe by the weight of each word after text vector, standardization, and construct the vector that can represent a certain class data, i.e. barycenter by arithmetic mean barycenter AAC method, barycenter is classification C
jin the arithmetic mean of all data, vectorial C
jcomputing formula be
wherein S is classification C
jthe number of middle sample, x
ithen classification C
jin the vector of sample;
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C
jsimilarity adopt cosine similarity measure:
Described step c specifically refers to:
Initiation parameter k is each classification C
j, j ∈ (1, p) parameters k
j, now k
1=k
2=...=k
p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k
j* Sim (x, C
j)), j ∈ (1, p) to data x (the x ∈ C in training set
i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C
iin data assign to class C by mistake
jin, increase k
i, reduce k
j(wherein k
i, k
jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k
i, k
jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k
jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k
j* Sim (x, C
j)), by data set D to be sorted
tEcarry out classification application.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510801697.7A CN105320968A (en) | 2015-11-19 | 2015-11-19 | Improved method for centroid classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510801697.7A CN105320968A (en) | 2015-11-19 | 2015-11-19 | Improved method for centroid classifier |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105320968A true CN105320968A (en) | 2016-02-10 |
Family
ID=55248322
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510801697.7A Pending CN105320968A (en) | 2015-11-19 | 2015-11-19 | Improved method for centroid classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105320968A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829807A (en) * | 2018-06-07 | 2018-11-16 | 武汉斗鱼网络科技有限公司 | A kind of public sentiment merging method, device, server and storage medium |
CN112214535A (en) * | 2020-10-22 | 2021-01-12 | 上海明略人工智能(集团)有限公司 | Similarity calculation method and system, electronic device and storage medium |
-
2015
- 2015-11-19 CN CN201510801697.7A patent/CN105320968A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108829807A (en) * | 2018-06-07 | 2018-11-16 | 武汉斗鱼网络科技有限公司 | A kind of public sentiment merging method, device, server and storage medium |
CN112214535A (en) * | 2020-10-22 | 2021-01-12 | 上海明略人工智能(集团)有限公司 | Similarity calculation method and system, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
CN105045812B (en) | The classification method and system of text subject | |
CN106445919A (en) | Sentiment classifying method and device | |
CN108108351A (en) | A kind of text sentiment classification method based on deep learning built-up pattern | |
CN107451278A (en) | Chinese Text Categorization based on more hidden layer extreme learning machines | |
CN106599054A (en) | Method and system for title classification and push | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN101714135B (en) | Emotional orientation analytical method of cross-domain texts | |
CN103365997A (en) | Opinion mining method based on ensemble learning | |
CN105205124A (en) | Semi-supervised text sentiment classification method based on random feature subspace | |
CN100412869C (en) | Improved file similarity measure method based on file structure | |
CN106095791A (en) | A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof | |
CN101882136A (en) | Method for analyzing emotion tendentiousness of text | |
CN108664633A (en) | A method of carrying out text classification using diversified text feature | |
CN104199829A (en) | Emotion data classifying method and system | |
CN104699685A (en) | Model updating device and method, data processing device and method, program | |
CN102693321A (en) | Cross-media information analysis and retrieval method | |
CN107292348A (en) | A kind of Bagging_BSJ short text classification methods | |
CN114139634A (en) | Multi-label feature selection method based on paired label weights | |
CN105320968A (en) | Improved method for centroid classifier | |
CN105550292B (en) | A kind of Web page classification method based on von Mises-Fisher probabilistic models | |
CN110674293B (en) | Text classification method based on semantic migration | |
CN108268458A (en) | A kind of semi-structured data sorting technique and device based on KNN algorithms | |
CN101727463A (en) | Text training method and text classifying method | |
Dong et al. | The research of kNN text categorization algorithm based on eager learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160210 |
|
RJ01 | Rejection of invention patent application after publication |