CN105320968A - Improved method for centroid classifier - Google Patents

Improved method for centroid classifier Download PDF

Info

Publication number
CN105320968A
CN105320968A CN201510801697.7A CN201510801697A CN105320968A CN 105320968 A CN105320968 A CN 105320968A CN 201510801697 A CN201510801697 A CN 201510801697A CN 105320968 A CN105320968 A CN 105320968A
Authority
CN
China
Prior art keywords
data
classification
parameter
barycenter
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510801697.7A
Other languages
Chinese (zh)
Inventor
刘川
汪文勇
夏守璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201510801697.7A priority Critical patent/CN105320968A/en
Publication of CN105320968A publication Critical patent/CN105320968A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses an improved method for a centroid classifier. The method comprises the following steps: a. selecting a training set and a test set, and constructing a centroid; b. computing a similarity degree of to-be-classified data and a centroid vector; c. initializing a parameter k, and setting the parameter k for each class, and adjusting a position of a classification face by adjusting the parameter k; d. training the parameter k; and e. applying the trained parameter k to the to-be-classified data. The improved method for the centroid classifier disclosed by the present invention improves classification accuracy of an original CBC on unbalanced classes and can be used for data classification.

Description

Improving one's methods of a kind of barycenter sorter
Technical field
The present invention relates to Computer Science and Technology field, relate to data classification method, can be used in the application of internet data classification, efficiency and the accuracy rate of information retrieval can be improved.
Background technology
Along with the development of infotech, the information that people can obtain presents explosive growth.In the face of increasing magnanimity information, how to obtain the data required for people fast and effectively, only rely on artificial mode to become more and more difficult to process these information.Need the aid of some robotizations to help people better manage and filter these information, CBC (based on barycenter sorter) is exactly more excellent data sorter.
Sorting technique based on barycenter is one of sorting technique of classics, and its thinking is simple, namely sorts out text according to data characteristics vector with the similarity of class centroid.Based on barycenter classification easy to understand and realization, be better than to stable performance NaiveBayes, KNN and C4.5 traditional decision-tree, and algorithm complex and text set scale linear, classification effectiveness is high, occurs that the probability of Expired Drugs is low.
As author be the beautiful plum of bavin, Zhu Guochong, Zan Hongying, Khuda is bright, Xian Jiayang equals on the periodical of periodical " computer engineering " by name, delivered autograph in October, 2009 is " Algorithm of documents categorization based on barycenter " paper, this paper to the effect that " when text set comparatively disperses or occurs multi-peak, the Algorithm of documents categorization classifying quality based on barycenter is very poor.Propose a kind of Algorithm of documents categorization of improvement for this problem, compared with the classical taxonomy algorithm based on barycenter, its performance is higher.The test result of UCK algorithm on text classification corpus provided in Hong Kong Hui Kexun industry company shows, efficiency and the precision of this algorithm meet the demands.
But the deficiency of traditional sorting technique based on barycenter that above review Wen Wei represents also is apparent: centroid vector calculates according to all proper vectors belonging to this classification text marked, comparatively disperse when belonging to the distribution of other text of same class, or when having overlap between classification, its classifying quality is poor.
Summary of the invention
The present invention is intended to for the defect existing for above-mentioned prior art and deficiency, improving one's methods of a kind of barycenter sorter is provided, this method is to each classification increase parameter, do not reducing on the basis of classification effectiveness, classifying quality is significantly promoted, solves the problem that original barycenter sorter is low to uneven class classification accuracy rate.
The present invention realizes by adopting following technical proposals:
Improving one's methods of a kind of barycenter sorter, is characterized in that step is as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
Described a step specifically refers to:
There will be a known the training set D of p classification information tR, data set D to be sorted tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w 1, w 2..., w n), wherein w 1, w 2, w nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C jin the arithmetic mean of all data, vectorial C jcomputing formula be wherein S is classification C jthe number of middle sample, x ithen classification C jin the vector of sample;
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C jsimilarity adopt cosine similarity measure:
S i m ( x , C j ) = x · C j | x | * | C j | ;
Described step c specifically refers to:
Initiation parameter k is each classification C j, j ∈ (1, p) parameters k j, now k 1=k 2=...=k p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k j* Sim (x, C j)), j ∈ (1, p) to data x (the x ∈ C in training set i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C iin data assign to class C by mistake jin, increase k i, reduce k j(wherein k i, k jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k i, k jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k j* Sim (x, C j)), by data set D to be sorted tEcarry out classification application.
Compared with prior art, the beneficial effect that reaches of the present invention is as follows:
By the step of a-e of the present invention, compared with prior art, it remains the linear feature of CBC sorting technique, and complexity is low, and classification speed is fast; Improve the accuracy of CBC sorting technique; Barycenter of the present invention is constant, adjusts classifying face, there will not be the phenomenon of centroid motion by parameter k;
Accompanying drawing explanation
Below in conjunction with specification drawings and specific embodiments, the present invention is described in further detail, wherein:
Fig. 1 illustrates original CBC principle of classification figure;
Fig. 2 illustrates that the present invention improves CBC principle of classification figure.
Embodiment
Embodiment 1
Fundamental purpose of the present invention is, to the improvement based on barycenter sorting technique (CBC), to make up the deficiency of original barycenter sorting technique, make classification results better.
CBC original classification process:
(2) barycenter is built.Use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w 1, w 2..., w n), and construct by methods such as AAC the vector that can represent a certain class data, i.e. barycenter.Wherein, ArithmeticalAverageCentroid (AAC), barycenter is classification C jin the arithmetic mean of all data, vectorial C jcomputing formula be wherein S is classification C jthe number of middle sample, x ithen classification C jin the vector of sample;
(2) similarity of data to be sorted and centroid vector is calculated.Calculate data x to be sorted and centroid vector C jsimilarity adopt cosine similarity measure:
(3) classify.The possibility that these data of the higher explanation of similarity of data and a certain classification belong to such is larger, these data is assigned to Class (x)=argmaxSim (x, C in the highest classification of similarity j).
In CBC original classification method, classifying face is the middle vertical plane y of two class barycenter a, b lines, and when the scope of two classes is more or less the same, classifying quality is better, as Fig. 1; When the scope difference of two classes is larger, according to the classifying face y of CBC original classification method, the dash area in Fig. 2 will be divided by mistake.The present invention just improves on the basis of CBC sorter, increases a parameter k to each class, the position of classifying face y is adjusted to effect that y' reaches optimization.
The training process of parameter k:
1. initialization k j, j ∈ (1, p) (p is the number of class in data set S); Build barycenter, calculate data x to be sorted and centroid vector C jsimilarity adopt cosine similarity measure:
be each class C simultaneously jparameters k j, now k 1=k 2=...=k p=1;
2. couple training data x (x ∈ C i, i ∈ (1, p)) basis
Class (x)=argmax (k j* Sim (x, C j)), (1, p) classify, classification correctly goes to step 4 to j ∈, otherwise goes to step 3;
3. superimpose data x divides to class C by mistake jin, increase k i, reduce k j(wherein k i, k jincrease the fixed value reduced to arrange voluntarily according to actual conditions);
4. repeat step 2,3, when classification accuracy rate floats within the specific limits, can parameter k be obtained jend value.
In the present invention, each class has himself parameter k, and in the process of training, if classification error appears in data, then two classes (that is: class C of really belonging to of data of contact occurs in adjustment iwith misjudged class C j) parameter k i, k j, thus the classifying face between class and class is adjusted, to reach the object improving classification accuracy rate.Meanwhile, in the present invention, barycenter remains unchanged, and is changed the position of classifying face by the size adjusting parameter k.
Concrete steps are as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
Described a step specifically refers to:
There will be a known the training set D of p classification information tR, data set D to be sorted tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w 1, w 2..., w n), wherein w 1, w 2, w nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C jin the arithmetic mean of all data
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C jsimilarity adopt cosine similarity measure:
S i m ( x , C j ) = x · C j | x | * | C j | ;
Described step c specifically refers to:
Initiation parameter k is each classification C j, j ∈ (1, p) parameters k j, now k 1=k 2=...=k p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k j* Sim (x, C j)), j ∈ (1, p) to data x (the x ∈ C in training set i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C iin data assign to class C by mistake jin, increase k i, reduce k j(wherein k i, k jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k i, k jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k j* Sim (x, C j)), by data set D to be sorted tEcarry out classification application.
Embodiment 2 experimental example
On hands-on collection, test below in conjunction with CBC innovatory algorithm that the invention will be further described:
The general thought of experiment is, the method after using original CBC method and improving is tested training set, and uses the method for cross validation to training set classification and assessment.
1. experiment material and index introduction:
Training set is Ohsumed (23 classes totally 13929 samples), Reuters-21578 (90 classes totally 21578 samples);
Experimental index: grand accurate rate macro_p, grand recall rate macro_r, grand average macro_F1 (mean value of macro_p and macro_r), micro-accurate rate micro_p, micro-recall rate micro_r, micro_F1 (mean value of micro_p and micro_r), above index value is higher, and to represent classifying quality better;
2. experiment flow
(1) training set is divided into 5 parts at random, in turn will wherein 4 parts train, verify for 1 part, i.e. cross-validation method;
(2) original CBC classification is used to carry out cross validation to data;
(3) the CBC classification after improving is used to carry out cross validation to data;
(4) to two kinds of classification Comparative result;
3. experimental result
Experimental result is as shown in the table, as can be seen from the table, improves CBC classification and is better than original CBC classification to two training set data classification results.Prove by experiment, the present invention is remarkable to original CBC classification improvement effect.
Reuters-21578---original CBC classification results
macro_p macro_r macro_F1 micro_p micro_r micro_F1
0 0.713 0.838 0.771 0.833 0.833 0.833
1 0.71 0.84 0.769 0.801 0.801 0.801
2 0.709 0.838 0.768 0.816 0.816 0.816
3 0.674 0.859 0.755 0.784 0.784 0.784
4 0.648 0.813 0.721 0.783 0.783 0.783
average 0.691 0.838 0.757 0.803 0.803 0.803
Reuters-21578---improve CBC classification results
macro_p macro_r macro_F1 micro_p micro_r micro_F1
0 0.788 0.819 0.803 0.911 0.911 0.911
1 0.821 0.845 0.833 0.909 0.909 0.909
2 0.817 0.828 0.822 0.911 0.911 0.911
3 0.775 0.874 0.821 0.924 0.924 0.924
4 0.768 0.79 0.779 0.885 0.885 0.885
average 0.794 0.831 0.812 0.908 0.908 0.908
Ohsumed---original CBC classification results
macro_p macro_r macro_F1 micro_p micro_r micro_F1
0 0.758 0.758 0.758 0.761 0.761 0.761
1 0.769 0.767 0.768 0.77 0.77 0.77
2 0.774 0.775 0.774 0.777 0.777 0.777
3 0.745 0.749 0.747 0.754 0.754 0.754
4 0.743 0.742 0.743 0.744 0.744 0.744
average 0.758 0.758 0.758 0.761 0.761 0.761
Ohsumed---improve CBC classification results
macro_p macro_r macro_F1 micro_p micro_r micro_F1
0 0.76 0.76 0.76 0.767 0.767 0.767
1 0.778 0.777 0.777 0.785 0.785 0.785
2 0.777 0.777 0.779 0.789 0.789 0.789
3 0.748 0.752 0.75 0.765 0.765 0.765
4 0.751 0.749 0.75 0.758 0.758 0.758
average 0.763 0.764 0.763 0.773 0.773 0.773
Main abbreviation:
CBC: classify based on barycenter;
VSM: vector space model;
AAC: arithmetic mean barycenter.

Claims (2)

1. the improving one's methods of a barycenter sorter, is characterized in that step is as follows:
A, selected training set and test set, build barycenter;
B, calculate the similarity of data to be sorted and centroid vector;
C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;
D, training parameter k;
E, the parameter k trained is applied to data to be sorted.
2. the improving one's methods of a kind of barycenter sorter according to claim 1, is characterized in that:
Described a step specifically refers to:
There will be a known the training set D of p classification information tR, data set D to be sorted tE; Build barycenter: use vector space model model, each data representation is become a corresponding vector x=(w 1, w 2..., w n), wherein w 1, w 2, w nbe by the weight of each word after text vector, standardization, and construct the vector that can represent a certain class data, i.e. barycenter by arithmetic mean barycenter AAC method, barycenter is classification C jin the arithmetic mean of all data, vectorial C jcomputing formula be wherein S is classification C jthe number of middle sample, x ithen classification C jin the vector of sample;
Described b step specifically refers to:
Calculate data x to be sorted and centroid vector C jsimilarity adopt cosine similarity measure:
S i m ( x , C j ) = x · C j | x | * | C j | ;
Described step c specifically refers to:
Initiation parameter k is each classification C j, j ∈ (1, p) parameters k j, now k 1=k 2=...=k p=1;
Described Step d specifically refers to:
Training parameter k: according to Class (x)=argmax (k j* Sim (x, C j)), j ∈ (1, p) to data x (the x ∈ C in training set i, i ∈ (1, p)) classifies:
1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;
2) if by class C iin data assign to class C by mistake jin, increase k i, reduce k j(wherein k i, k jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k i, k jafter, continue to classify to the data in training set;
3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k jnet result;
Described step e specifically refers to:
According to Class (x)=argmax (k j* Sim (x, C j)), by data set D to be sorted tEcarry out classification application.
CN201510801697.7A 2015-11-19 2015-11-19 Improved method for centroid classifier Pending CN105320968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510801697.7A CN105320968A (en) 2015-11-19 2015-11-19 Improved method for centroid classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510801697.7A CN105320968A (en) 2015-11-19 2015-11-19 Improved method for centroid classifier

Publications (1)

Publication Number Publication Date
CN105320968A true CN105320968A (en) 2016-02-10

Family

ID=55248322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510801697.7A Pending CN105320968A (en) 2015-11-19 2015-11-19 Improved method for centroid classifier

Country Status (1)

Country Link
CN (1) CN105320968A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829807A (en) * 2018-06-07 2018-11-16 武汉斗鱼网络科技有限公司 A kind of public sentiment merging method, device, server and storage medium
CN112214535A (en) * 2020-10-22 2021-01-12 上海明略人工智能(集团)有限公司 Similarity calculation method and system, electronic device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829807A (en) * 2018-06-07 2018-11-16 武汉斗鱼网络科技有限公司 A kind of public sentiment merging method, device, server and storage medium
CN112214535A (en) * 2020-10-22 2021-01-12 上海明略人工智能(集团)有限公司 Similarity calculation method and system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN105045812B (en) The classification method and system of text subject
CN106445919A (en) Sentiment classifying method and device
CN108108351A (en) A kind of text sentiment classification method based on deep learning built-up pattern
CN107451278A (en) Chinese Text Categorization based on more hidden layer extreme learning machines
CN106599054A (en) Method and system for title classification and push
CN103020167B (en) A kind of computer Chinese file classification method
CN101714135B (en) Emotional orientation analytical method of cross-domain texts
CN103365997A (en) Opinion mining method based on ensemble learning
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN100412869C (en) Improved file similarity measure method based on file structure
CN106095791A (en) A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof
CN101882136A (en) Method for analyzing emotion tendentiousness of text
CN108664633A (en) A method of carrying out text classification using diversified text feature
CN104199829A (en) Emotion data classifying method and system
CN104699685A (en) Model updating device and method, data processing device and method, program
CN102693321A (en) Cross-media information analysis and retrieval method
CN107292348A (en) A kind of Bagging_BSJ short text classification methods
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN105320968A (en) Improved method for centroid classifier
CN105550292B (en) A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN110674293B (en) Text classification method based on semantic migration
CN108268458A (en) A kind of semi-structured data sorting technique and device based on KNN algorithms
CN101727463A (en) Text training method and text classifying method
Dong et al. The research of kNN text categorization algorithm based on eager learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160210

RJ01 Rejection of invention patent application after publication