CN105320968A

CN105320968A - Improved method for centroid classifier

Info

Publication number: CN105320968A
Application number: CN201510801697.7A
Authority: CN
Inventors: 刘川; 汪文勇; 夏守璐
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-11-19
Filing date: 2015-11-19
Publication date: 2016-02-10

Abstract

The present invention discloses an improved method for a centroid classifier. The method comprises the following steps: a. selecting a training set and a test set, and constructing a centroid; b. computing a similarity degree of to-be-classified data and a centroid vector; c. initializing a parameter k, and setting the parameter k for each class, and adjusting a position of a classification face by adjusting the parameter k; d. training the parameter k; and e. applying the trained parameter k to the to-be-classified data. The improved method for the centroid classifier disclosed by the present invention improves classification accuracy of an original CBC on unbalanced classes and can be used for data classification.

Description

Improving one's methods of a kind of barycenter sorter

Technical field

The present invention relates to Computer Science and Technology field, relate to data classification method, can be used in the application of internet data classification, efficiency and the accuracy rate of information retrieval can be improved.

Background technology

Along with the development of infotech, the information that people can obtain presents explosive growth.In the face of increasing magnanimity information, how to obtain the data required for people fast and effectively, only rely on artificial mode to become more and more difficult to process these information.Need the aid of some robotizations to help people better manage and filter these information, CBC (based on barycenter sorter) is exactly more excellent data sorter.

Sorting technique based on barycenter is one of sorting technique of classics, and its thinking is simple, namely sorts out text according to data characteristics vector with the similarity of class centroid.Based on barycenter classification easy to understand and realization, be better than to stable performance NaiveBayes, KNN and C4.5 traditional decision-tree, and algorithm complex and text set scale linear, classification effectiveness is high, occurs that the probability of Expired Drugs is low.

As author be the beautiful plum of bavin, Zhu Guochong, Zan Hongying, Khuda is bright, Xian Jiayang equals on the periodical of periodical " computer engineering " by name, delivered autograph in October, 2009 is " Algorithm of documents categorization based on barycenter " paper, this paper to the effect that " when text set comparatively disperses or occurs multi-peak, the Algorithm of documents categorization classifying quality based on barycenter is very poor.Propose a kind of Algorithm of documents categorization of improvement for this problem, compared with the classical taxonomy algorithm based on barycenter, its performance is higher.The test result of UCK algorithm on text classification corpus provided in Hong Kong Hui Kexun industry company shows, efficiency and the precision of this algorithm meet the demands.

But the deficiency of traditional sorting technique based on barycenter that above review Wen Wei represents also is apparent: centroid vector calculates according to all proper vectors belonging to this classification text marked, comparatively disperse when belonging to the distribution of other text of same class, or when having overlap between classification, its classifying quality is poor.

Summary of the invention

The present invention is intended to for the defect existing for above-mentioned prior art and deficiency, improving one's methods of a kind of barycenter sorter is provided, this method is to each classification increase parameter, do not reducing on the basis of classification effectiveness, classifying quality is significantly promoted, solves the problem that original barycenter sorter is low to uneven class classification accuracy rate.

The present invention realizes by adopting following technical proposals:

Improving one's methods of a kind of barycenter sorter, is characterized in that step is as follows:

A, selected training set and test set, build barycenter;

B, calculate the similarity of data to be sorted and centroid vector;

C, initiation parameter k are each class parameters k, are adjusted the position of classifying face by adjustment parameter k;

D, training parameter k;

E, the parameter k trained is applied to data to be sorted.

Described a step specifically refers to:

There will be a known the training set D of p classification information _tR, data set D to be sorted _tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w ₁, w ₂..., w _n), wherein w ₁, w ₂, w _nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C _jin the arithmetic mean of all data, vectorial C _jcomputing formula be wherein S is classification C _jthe number of middle sample, x _ithen classification C _jin the vector of sample;

Described b step specifically refers to:

Calculate data x to be sorted and centroid vector C _jsimilarity adopt cosine similarity measure:

S i m (x, C_{j}) = \frac{x \cdot C_{j}}{| x | * | C_{j} |};

Described step c specifically refers to:

Initiation parameter k is each classification C _j, j ∈ (1, p) parameters k _j, now k ₁=k ₂=...=k _p=1;

Described Step d specifically refers to:

Training parameter k: according to Class (x)=argmax (k _j* Sim (x, C _j)), j ∈ (1, p) to data x (the x ∈ C in training set _i, i ∈ (1, p)) classifies:

1) if classification is correct, parameter k does not make an amendment, and continues to classify to the data in training set;

2) if by class C _iin data assign to class C by mistake _jin, increase k _i, reduce k _j(wherein k _i, k _jincrease the fixed value reduced to arrange voluntarily according to actual conditions), upgrade k _i, k _jafter, continue to classify to the data in training set;

3) repeat step 4, the deconditioning when classification accuracy rate floats within the specific limits, obtains parameter k _jnet result;

Described step e specifically refers to:

According to Class (x)=argmax (k _j* Sim (x, C _j)), by data set D to be sorted _tEcarry out classification application.

Compared with prior art, the beneficial effect that reaches of the present invention is as follows:

By the step of a-e of the present invention, compared with prior art, it remains the linear feature of CBC sorting technique, and complexity is low, and classification speed is fast; Improve the accuracy of CBC sorting technique; Barycenter of the present invention is constant, adjusts classifying face, there will not be the phenomenon of centroid motion by parameter k;

Accompanying drawing explanation

Below in conjunction with specification drawings and specific embodiments, the present invention is described in further detail, wherein:

Fig. 1 illustrates original CBC principle of classification figure;

Fig. 2 illustrates that the present invention improves CBC principle of classification figure.

Embodiment

Embodiment 1

Fundamental purpose of the present invention is, to the improvement based on barycenter sorting technique (CBC), to make up the deficiency of original barycenter sorting technique, make classification results better.

CBC original classification process:

(2) barycenter is built.Use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w ₁, w ₂..., w _n), and construct by methods such as AAC the vector that can represent a certain class data, i.e. barycenter.Wherein, ArithmeticalAverageCentroid (AAC), barycenter is classification C _jin the arithmetic mean of all data, vectorial C _jcomputing formula be wherein S is classification C _jthe number of middle sample, x _ithen classification C _jin the vector of sample;

(2) similarity of data to be sorted and centroid vector is calculated.Calculate data x to be sorted and centroid vector C _jsimilarity adopt cosine similarity measure:

(3) classify.The possibility that these data of the higher explanation of similarity of data and a certain classification belong to such is larger, these data is assigned to Class (x)=argmaxSim (x, C in the highest classification of similarity _j).

In CBC original classification method, classifying face is the middle vertical plane y of two class barycenter a, b lines, and when the scope of two classes is more or less the same, classifying quality is better, as Fig. 1; When the scope difference of two classes is larger, according to the classifying face y of CBC original classification method, the dash area in Fig. 2 will be divided by mistake.The present invention just improves on the basis of CBC sorter, increases a parameter k to each class, the position of classifying face y is adjusted to effect that y' reaches optimization.

The training process of parameter k:

1. initialization k _j, j ∈ (1, p) (p is the number of class in data set S); Build barycenter, calculate data x to be sorted and centroid vector C _jsimilarity adopt cosine similarity measure:

be each class C simultaneously _jparameters k _j, now k ₁=k ₂=...=k _p=1;

2. couple training data x (x ∈ C _i, i ∈ (1, p)) basis

Class (x)=argmax (k _j* Sim (x, C _j)), (1, p) classify, classification correctly goes to step 4 to j ∈, otherwise goes to step 3;

3. superimpose data x divides to class C by mistake _jin, increase k _i, reduce k _j(wherein k _i, k _jincrease the fixed value reduced to arrange voluntarily according to actual conditions);

4. repeat step 2,3, when classification accuracy rate floats within the specific limits, can parameter k be obtained _jend value.

In the present invention, each class has himself parameter k, and in the process of training, if classification error appears in data, then two classes (that is: class C of really belonging to of data of contact occurs in adjustment _iwith misjudged class C _j) parameter k _i, k _j, thus the classifying face between class and class is adjusted, to reach the object improving classification accuracy rate.Meanwhile, in the present invention, barycenter remains unchanged, and is changed the position of classifying face by the size adjusting parameter k.

Concrete steps are as follows:

A, selected training set and test set, build barycenter;

B, calculate the similarity of data to be sorted and centroid vector;

D, training parameter k;

E, the parameter k trained is applied to data to be sorted.

Described a step specifically refers to:

There will be a known the training set D of p classification information _tR, data set D to be sorted _tE; Build barycenter: use vector space model (VectorSpaceModel) VSM model, each data representation is become a corresponding vector x=(w ₁, w ₂..., w _n), wherein w ₁, w ₂, w _nby the weight of each word after text vector, standardization, and construct by arithmetic mean barycenter AAC method the vector that can represent a certain class data, i.e. barycenter, ArithmeticalAverageCentroid (AAC) barycenter is classification C _jin the arithmetic mean of all data

Described b step specifically refers to:

S i m (x, C_{j}) = \frac{x \cdot C_{j}}{| x | * | C_{j} |};

Described step c specifically refers to:

Described Step d specifically refers to:

Described step e specifically refers to:

Embodiment 2 experimental example

On hands-on collection, test below in conjunction with CBC innovatory algorithm that the invention will be further described:

The general thought of experiment is, the method after using original CBC method and improving is tested training set, and uses the method for cross validation to training set classification and assessment.

1. experiment material and index introduction:

Training set is Ohsumed (23 classes totally 13929 samples), Reuters-21578 (90 classes totally 21578 samples);

Experimental index: grand accurate rate macro_p, grand recall rate macro_r, grand average macro_F1 (mean value of macro_p and macro_r), micro-accurate rate micro_p, micro-recall rate micro_r, micro_F1 (mean value of micro_p and micro_r), above index value is higher, and to represent classifying quality better;

2. experiment flow

(1) training set is divided into 5 parts at random, in turn will wherein 4 parts train, verify for 1 part, i.e. cross-validation method;

(2) original CBC classification is used to carry out cross validation to data;

(3) the CBC classification after improving is used to carry out cross validation to data;

(4) to two kinds of classification Comparative result;

3. experimental result

Experimental result is as shown in the table, as can be seen from the table, improves CBC classification and is better than original CBC classification to two training set data classification results.Prove by experiment, the present invention is remarkable to original CBC classification improvement effect.

Reuters-21578---original CBC classification results

	macro_p	macro_r	macro_F1	micro_p	micro_r	micro_F1
							0	0.713	0.838	0.771	0.833	0.833	0.833
1	0.71	0.84	0.769	0.801	0.801	0.801
							2	0.709	0.838	0.768	0.816	0.816	0.816
3	0.674	0.859	0.755	0.784	0.784	0.784
							4	0.648	0.813	0.721	0.783	0.783	0.783
average	0.691	0.838	0.757	0.803	0.803	0.803

Reuters-21578---improve CBC classification results

	macro_p	macro_r	macro_F1	micro_p	micro_r	micro_F1
							0	0.788	0.819	0.803	0.911	0.911	0.911
1	0.821	0.845	0.833	0.909	0.909	0.909
							2	0.817	0.828	0.822	0.911	0.911	0.911
3	0.775	0.874	0.821	0.924	0.924	0.924
							4	0.768	0.79	0.779	0.885	0.885	0.885
average	0.794	0.831	0.812	0.908	0.908	0.908

Ohsumed---original CBC classification results

	macro_p	macro_r	macro_F1	micro_p	micro_r	micro_F1
							0	0.758	0.758	0.758	0.761	0.761	0.761
1	0.769	0.767	0.768	0.77	0.77	0.77
							2	0.774	0.775	0.774	0.777	0.777	0.777
3	0.745	0.749	0.747	0.754	0.754	0.754
							4	0.743	0.742	0.743	0.744	0.744	0.744
average	0.758	0.758	0.758	0.761	0.761	0.761

Ohsumed---improve CBC classification results

	macro_p	macro_r	macro_F1	micro_p	micro_r	micro_F1
							0	0.76	0.76	0.76	0.767	0.767	0.767
1	0.778	0.777	0.777	0.785	0.785	0.785
							2	0.777	0.777	0.779	0.789	0.789	0.789
3	0.748	0.752	0.75	0.765	0.765	0.765
							4	0.751	0.749	0.75	0.758	0.758	0.758
average	0.763	0.764	0.763	0.773	0.773	0.773

Main abbreviation:

CBC: classify based on barycenter;

VSM: vector space model;

AAC: arithmetic mean barycenter.

Claims

1. the improving one's methods of a barycenter sorter, is characterized in that step is as follows:

A, selected training set and test set, build barycenter;

B, calculate the similarity of data to be sorted and centroid vector;

D, training parameter k;

E, the parameter k trained is applied to data to be sorted.

2. the improving one's methods of a kind of barycenter sorter according to claim 1, is characterized in that:

Described a step specifically refers to:

There will be a known the training set D of p classification information _tR, data set D to be sorted _tE; Build barycenter: use vector space model model, each data representation is become a corresponding vector x=(w ₁, w ₂..., w _n), wherein w ₁, w ₂, w _nbe by the weight of each word after text vector, standardization, and construct the vector that can represent a certain class data, i.e. barycenter by arithmetic mean barycenter AAC method, barycenter is classification C _jin the arithmetic mean of all data, vectorial C _jcomputing formula be wherein S is classification C _jthe number of middle sample, x _ithen classification C _jin the vector of sample;

Described b step specifically refers to:

S i m (x, C_{j}) = \frac{x \cdot C_{j}}{| x | * | C_{j} |};

Described step c specifically refers to:

Described Step d specifically refers to:

Described step e specifically refers to: