CN102799902A

CN102799902A - Enhanced relationship classifier based on representative samples

Info

Publication number: CN102799902A
Application number: CN201210287636XA
Authority: CN
Inventors: 蔡维玲
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2012-08-13
Filing date: 2012-08-13
Publication date: 2012-11-28

Abstract

The invention relates to an enhanced relationship classifier based on representative samples. The method mainly comprises two steps: first, selecting representative samples to form a new training sample set Xnew according to the clustering membership of samples; and then, constructing a fuzzy relationship matrix R with a Phi composite operator about the clustering membership and the class membership of Xnew. The enhanced relationship classifier is mainly characterized in that (1) the matrix R can reveal the inherent logic relationship between a cluster and a class; (2) the computation complexity of the matrix R decreases from O(NLc) to O(MLc), wherein L is the number of classes, c is the number of clusters, N is the number of samples of the original dataset X, M is the number of samples of Xnew, and N is greater than M; and (3) when sufficient judgment information cannot be found in certain areas in the sample space, the classifier rejects to make strategies for test samples falling into the areas, so as to guarantee the confidence level of classification results.

Description

A kind of enhancement mode based on representative sample concerns sorter

Technical field

The invention belongs to area of pattern recognition, particularly a kind of sorter that concerns based on cluster analysis.

Background technology

The main task of pattern-recognition is that the various forms of information that characterize affairs or phenomenon are handled and analyzed, thereby things or phenomenon are classified (or grouping) and explained.Traditional mode identification field comprises two important research themes, does not promptly have the classification of supervision type cluster and supervision type.

The classification of supervision type is intended to according to given data and type label type of designing discriminant function thereof, thereby can make correct prediction to the classification of unknown sample.These class methods are paid close attention to the classification ownership of sample, can not cause good relatively generalization to meeting sample.But such algorithm is only stressed the classification individual to sample, and has ignored the portrayal of mutual relationship between excavation and the sample of the structural knowledge that sample space is hidden, thereby has caused the interpretation and the transparent variation of classification results.Typical method comprises neural network (Neural Networks), and SVMs (Support Vector Machine, SVM) etc.No supervision type cluster is intended to utilize the similarity between sample, the sample with identical characteristics assign to same have certain meaning bunch in, thereby find the potential distributed architecture of sample, understand better and analyze data.These class methods can disclose the structure distribution of data, but the classification of the sample of can't making a strategic decision ownership.

These two class methods respectively have relative merits, and the method for therefore designing and have both advantages concurrently, overcome both shortcomings is a very important research topic.Around this problem, the researchist has proposed serial of methods.See that from design cycle these methods all are to use clustering algorithm to excavate the immanent structure of data earlier, utilize the data structure that obtains to come design category mechanism again.(Radial Basis Function Neural Network is typically not have supervision type cluster+classifier design RBFNN) to radial primary function network.RBFNN uses no supervision type clustering algorithm such as C-average or Fuzzy C average to confirm the hidden node parameter earlier; (Mean Squared Error MSE) optimizes connection weights between latent layer and the output layer to utilize square error criterion between true output and target output again.The no supervision type clustering method here is used to confirm the complexity and the parameter of network, only is the supplementary means of network design therefore, can't really play the effect that discloses the data immanent structure.So RBFNN does not have real Fusion of Clustering study and classification learning advantage separately.Study vector quantization (Learning Vector Quantization) utilizes the LVQ clustering algorithm to obtain the position and the classification information thereof of central point (being code book); And (1Neighbor-Nearest 1NN) realizes classification feature to use 1 neighbour based on these central points.In fact, these algorithms all do not pass through real training in the classifier design stage, and in other words, they do not carry out the real design of sorter.

(Fuzzy Relational Classifier FRC) has really realized the classify mutual supplement with each other's advantages of two class methods of no supervision type cluster and supervision type to the fuzzy relation sorter.FRC concerns and links up cluster and classification through making up fuzzy logic between cluster and classification, reaches the transparency and the interpretation of classification results.FRC has two significant advantages: (1) utilizes the operator computing to construct fuzzy relation matrix, thus the internal logical relationship between the cluster of disclosing and classification; (2) when there are not enough discriminant informations in some zone of sample space, sorter will be refused to make a policy to falling into this regional test sample book, thereby guarantee the confidence level of classification results.

Have an important relational matrix R in the FRC sorter, its effect is the structure and the relation of the fuzzy logic between classification of portrayal data.The correctness of this matrix has determined the validity and the robustness of FRC classification to a great extent.And in FRC, sorter uses all samples of training set to construct R, and does not use sample point distinctively according to the design feature of the input space.When data centralization contained more class overlapping region, the R that this mode is constructed can't correctly reflect classification and interstructural logical relation truly, thereby causes FRC to have following defective: classification lacked robustness; Classification performance descends; Heavy computational burden.The reason of this phenomenon is that the sample of type of falling into overlapping region makes the relational matrix R of final generation can not correctly reflect the characteristic distributions of data.

Summary of the invention

In order to overcome the problems referred to above; Through utilizing training sample distinctively; The present invention proposes a kind of enhancement mode and concern sorter (Enhanced FRC based on representative sample; EFRC), the R in this sorter can reflect the logical relation between data structure and classification more realistically, therefore can effectively improve the validity of sorter.

For realizing the foregoing invention purpose, the technical scheme that the present invention adopted is:

A kind of enhancement mode based on representative sample concerns sorter, may further comprise the steps:

Step 1: adopt unsupervised Fuzzy C average to produce cluster degree of membership matrix U and cluster centre V;

Step 2:, confirm representative sample set X according to the cluster degree of membership matrix U of all samples _New, concrete grammar is: according to cluster degree of membership set { u _Ij, X divides firmly to the training sample set, forms c sample subclass C _jAt each sample subclass C _jIn, sample is arranged according to its degree of membership value to j cluster from big to small; Sample subclass C after arrangement _jIn, select the bigger preceding λ % sample of cluster degree of membership to form representational sample set λ ∈ (0,1);

Step 3: according to representational sample set X _NewCluster degree of membership and type label thereof, utilize the φ composition operators to set up the fuzzy relationship matrix r between cluster and classification, concrete grammar is: at first, utilize the φ composition operators to calculate representational sample set X _NewIn each sample point corresponding relationship matrix Ri:

(r _jl) _i=min(1,1-u _ji+y _li),l=1,2,…,L,j=1,2,…,c （1）

Y wherein _LiBe the degree of membership of i sample to l classification, its value is confirmed by following formula:

Secondly, can be through fuzzy composite operator with all sample corresponding relationship matrix R _iAggregate into final relational matrix

Wherein each element calculates through minimization function:

r_{jl} = \min_{i = 1,2, \cdot \cdot \cdot, N} [{(r_{jl})}_{i}] - - - (3)

Step 4: according to the degree of membership

of this sample of distance calculation between test sample book x and the cluster centre V to all clusters

{\hat{u}}_{jx} = \frac{{| | x - v_{j} | |}^{- 2 / (m - 1)}}{Σ_{j = 1}^{c} {| | x - v_{j} | |}^{- 2 / (m - 1)}} - - - (4)

Wherein

expression test sample book x is to the degree of membership of j cluster;

Step 5: utilize degree of membership

Classification degree of membership with relational matrix R calculating test sample book x

{\hat{y}}_{x} = [{\hat{y}}_{1 x},

{\hat{y}}_{2 x}, \cdot \cdot \cdot, {\hat{y}}_{Lx}, \cdot \cdot \cdot, {\hat{y}}_{Lx}] :

Wherein ° _TBe the sup-t composition operators, the classification degree of membership

In each element be calculated as follows:

{\hat{y}}_{lx} = \max_{1 \leq j \leq c} [\max ({\hat{u}}_{jx} + r_{jl} - 1, 0)], l = 1,2, \cdot \cdot \cdot, L - - - (6)

Step 6: Depending on the category of membership

maximize operator obtained with the test sample x class label

final output of the category number.

Method of the present invention constructs representative new data acquisition X at first through the purifying cluster from original training set _New, try hard to eliminate overlapping region between the class in the luv space; Utilize X then _NewRather than all sample comes structural matrix R, thereby realizes the classification prediction to test sample book.Provable through mathematical derivation: compare with the FRC sorter, EFRC sorter of the present invention can not only keep between cluster and classification incompatible relation constant, but also can improve the compatibility relation between cluster and classification.Experimental result shows that the matrix R in the EFRC sorter of the present invention can reflect the logical relation between data structure and classification more realistically, therefore can improve the validity of sorter well.

Description of drawings

Fig. 1 is the method flow diagram of sorter of the present invention.

Fig. 2 is a data set sample distribution synoptic diagram in the embodiment of the invention.

Fig. 3 is the relational matrix R that obtained of the inventive method and FRC method and the comparison sheet of classification accuracy rate.

Embodiment

The present invention has at first set following experiment service condition:

1, each dimensional feature of data centralization normalizes to interval [0,1] through the minimax normalization method;

2, the cluster number c among the EFRC is at [c _Min, c _Max] confirm c wherein in the scope _MinBe the number of classification, c _MaxFor

(N is a number of samples).

3, X _NewChoose in (0,1) scope with the ratio lambda of original collection X.

Be the basis with above-mentioned condition, the enhancement mode based on representative sample that the present invention proposes concerns that sorter realizes in science computing platform Matlab, and has proved the validity of this method through the experimental result among the Matlab.

The concrete grammar process flow diagram of EFRC sorter of the present invention is as shown in Figure 1.Below in conjunction with accompanying drawing, further describe practical implementation step of the present invention:

Step 1: (Fuzzy C-Means FCM) produces cluster degree of membership matrix U and cluster centre V to adopt unsupervised Fuzzy C average.

Given N training sample set X={x ₁, x ₂..., x _NAnd type label set ω={ ω ₁, ω ₂..., ω _N, x wherein _i∈ R ^d, ω _i∈ 1,2 ..., L}, N are number of samples.U={u ₁, u ₂..., u _NThe cluster degree of membership set of expression training sample, wherein u _JiRepresent that i sample belongs to the degree of membership of j cluster centre; (1≤m＜∞) is u to m _JiWeighted index, be used for controlling the fog-level of cluster result, be traditionally arranged to be 2.

The concrete execution in step of Fuzzy C mean algorithm is following: at first, and initialization cluster centre V=[v ₁, v ₂..., v _c] and ε is set is very little positive number; Then, upgrade degree of membership matrix U and the cluster centre V of FCM according to formula (1) and formula (2);

u_{ik} = \frac{{({| | x_{k} - v_{i} | |}^{2})}^{- \frac{1}{(m - 1)}}}{Σ_{j = 1}^{c} {({| | x_{k} - v_{i} | |}^{2})}^{- \frac{1}{(m - 1)}}} - - - (1)

v_{i} = \frac{Σ_{k = 1}^{N} u_{ik}^{m} x_{k}}{Σ_{k = 1}^{N} {u_{ik}}^{m}} - - - (2)

Repeat above-mentioned step of updating and satisfy following condition up to resulting cluster centre: | V _New-V _Old|<ε.FCM can obtain the locally optimal solution of objective function through above-mentioned alternately iterative equation.

Step 2:, confirm representative sample set X according to the cluster degree of membership of all samples _New

At training set X={x ₁, x ₂..., x _NGo up to carry out clustering algorithm after, according to cluster degree of membership { u _Ij, X is divided firmly, form c sample subclass C _jAt each C _jIn, sample is arranged according to its degree of membership value to j cluster from big to small; Sample subclass C after arrangement _jIn, select the bigger preceding λ % sample of cluster degree of membership to form Here, λ ∈ (0,1), its value is X _NewRatio with original collection X.

Step 3: according to representative sample X _NewCluster degree of membership and type label thereof, utilize the φ composition operators to set up the fuzzy relationship matrix r between cluster and classification.

X _NewThe all corresponding relational matrix R of each sample point in the set _i, its concrete element value is calculated by the φ composition operators:

(r _jl) _i=min(1,1-u _ji+y _li),l=1,2,…,L,j=1,2,…,c (3)

Can be through fuzzy composite operator with the R of all samples correspondences _iAggregate into final relational matrix R:

R = \cap_{i = 1}^{N} R_{i} - - - (5)

Wherein each element calculates through minimization function:

r_{jl} = \min_{i = 1,2, \cdot \cdot \cdot, N} [{(r_{jl})}_{i}] - - - (6)

The size of matrix R is c * L, respectively a corresponding c cluster centre and L classification:

R = [\begin{matrix} r_{11} & r_{12} & \cdot \cdot \cdot & r_{1 L} \\ r_{21} & r_{22} & \cdot \cdot \cdot & r_{2 L} \\ \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot & \cdot \cdot \cdot \\ r_{c 1} & r_{c 2} & \cdot \cdot \cdot & r_{cL} \end{matrix}] - - - (7)

R wherein _IlThe compatibility relation of representing i cluster and l classification.Its value is big more, representes between this cluster and the classification compatible more; Its value is more little, representes between this cluster and the classification incompatible more.

Step 4: accomplish classification prediction to test sample book x according to fuzzy relationship matrix r and cluster centre V.

To certain test sample book x, the assorting process of FRC comprises following three steps.At first, according to the degree of membership

of this sample of the distance calculation between sample x and the cluster centre to all clusters

{\hat{u}}_{jx} = \frac{{| | x - v_{j} | |}^{- 2 / (m - 1)}}{Σ_{j = 1}^{c} {| | x - v_{j} | |}^{- 2 / (m - 1)}} - - - (8)

Wherein

expression x is to the degree of membership of j cluster.In second step, utilize the degree of membership vector

Classification degree of membership with relational matrix R calculating sample x

{\hat{y}}_{x} = [{\hat{y}}_{1 x}, {\hat{y}}_{2 x}, \cdot \cdot \cdot, {\hat{y}}_{Lx}, \cdot \cdot \cdot, {\hat{y}}_{Lx}] :

Wherein ° _TBe the sup-t composition operators.Each element in the vector

is calculated as follows:

{\hat{y}}_{lx} = \max_{1 \leq j \leq c} [\max ({\hat{u}}_{jx} + r_{jl} - 1, 0)], l = 1,2, \cdot \cdot \cdot, L - - - (10)

At last, obtain the class label

of sample x with maximization operator clips

{\hat{ω}}_{x} = \underset{1 \leq l \leq L}{\arg \max} {\hat{y}}_{lx} - - - (11)

Note as when a plurality of maximal value is arranged; Sorter will make a policy to sample x refusal; That is to say that a refusal decision-making will be made sample x.The refusal decision-making means that training set contains contradictory information or do not have lacking information at special area.

Than the FRC sorter, EFRC sorter of the present invention has strengthened the robustness of classification, and has improved the degree of reliability of classification.In addition, the computation complexity of R is reduced to O (MLc) from O (NLc) in the EFRC sorter, and wherein L is the classification number, and c is the cluster number, and N is the number of samples of raw data set X, and M is X _NewNumber of samples, and N＞M.

Consideration is experiment Analysis on data set shown in Figure 2, and this data set comprises three clusters and two types of samples, and the sample among cluster C1 and the C2 is respectively from classification 1, and the sample among the cluster C3 is from classification 2, and this data set exists class to a certain degree overlapping.Accompanying drawing 3 has provided on this data set, the experimental result contrast of the present invention and FRC method.Can be known by Fig. 3: the R that FRC obtains on this data set does

R_{FRC} = [\begin{matrix} 0.88 & 0.01 \\ 0.15 & 0 \\ 0.01 & 0.05 \end{matrix}]

Element 0.88 in matrix first row is far longer than 0.01, that is to say that the compatibility relation between cluster C1 and the classification 1 is better than the compatibility relation of C1 and classification 2, and this conclusion has reflected the design feature of data set truly.But the element 0.15 (0.05) in matrix second (three) row representes that then the compatibility relation between cluster C2 (C3) and classification 1 (2) is very faint, so the relation between incorrect reflection cluster of (three) row of second among the R and classification.Based on such R, FRC has obtained 100%, 64.4% and 22.0% classification accuracy rate respectively at the cluster C1 of test set on C2 and the C3.Can obtain as drawing a conclusion from this embodiment: when the data during type of containing overlapping region, the R that FRC uses all training samples to obtain with being equal to can't correctly reflect the true distribution of data, thereby causes classification performance bad.The R that EFRC obtains on these data does

R_{EFRC} = [\begin{matrix} 1.00 & 0.01 \\ 1.00 & 0.00 \\ 0.01 & 1.00 \end{matrix}]

Show that cluster C1 and C2 and classification 1 have very strong compatibility relation, cluster C3 and classification 2 have stronger logical relation.Hence one can see that, and the matrix R among the EFRC can reflect the logical relation between data structure and classification more realistically, therefore can effectively improve the classification correctness of FRC.

The content of not doing in the application form of the present invention to describe in detail belongs to this area professional and technical personnel's known prior art.

Claims

1. the enhancement mode based on representative sample concerns sorter, it is characterized in that may further comprise the steps:

Step 2:, confirm representative sample set X according to the cluster degree of membership matrix U of all samples _New, concrete grammar is: according to cluster degree of membership set { u _Ij, X divides firmly to the training sample set, forms c sample subclass C _jAt each sample subclass C _jIn, sample is arranged according to its degree of membership value to j cluster from big to small; Sample subclass C after arrangement _jIn, select the bigger preceding λ % sample of cluster degree of membership to form representational sample set

λ ∈ (0,1);

Step 3: according to representational sample set X _NewCluster degree of membership and type label thereof, utilize the φ composition operators to set up the fuzzy relationship matrix r between cluster and classification, concrete grammar is: at first, utilize the φ composition operators to calculate representational sample set X _NewIn each sample point corresponding relationship matrix R _i:

(r _jl) _i=min(1,1-u _ji+y _li),l=1,2,…,L,j=1,2,…,c （1）

Wherein each element calculates through minimization function:

r_{jl} = \min_{i = 1,2, \cdot \cdot \cdot, N} [{(r_{jl})}_{i}] - - - (3)

Step 4: according to the degree of membership of this sample of distance calculation between test sample book x and the cluster centre V to all clusters

{\hat{u}}_{x} = [{\hat{u}}_{1 x}, {\hat{u}}_{2 x}, \cdot \cdot \cdot, {\hat{u}}_{Jx}, \cdot \cdot \cdot, {\hat{u}}_{Cx}] :

{\hat{u}}_{jx} = \frac{{| | x - v_{j} | |}^{- 2 / (m - 1)}}{Σ_{j = 1}^{c} {| | x - v_{j} | |}^{- 2 / (m - 1)}} - - - (4)

Wherein

expression test sample book x is to the degree of membership of j cluster;

Step 5: utilize degree of membership Classification degree of membership with relational matrix R calculating test sample book x

{\hat{y}}_{x} = [{\hat{y}}_{1 x},

{\hat{y}}_{2 x}, \cdot \cdot \cdot, {\hat{y}}_{Lx}, \cdot \cdot \cdot, {\hat{y}}_{Lx}] :

In each element be calculated as follows:

{\hat{y}}_{lx} = \max_{1 \leq j \leq c} [\max ({\hat{u}}_{jx} + r_{jl} - 1, 0)], l = 1,2, \cdot \cdot \cdot, L - - - (6)

Step 6: Depending on the category of membership

maximize operator obtained with the test sample x class label

final output of the category number.