CN105005783B

CN105005783B - The method of classification information is extracted from higher-dimension asymmetric data

Info

Publication number: CN105005783B
Application number: CN201510251168.4A
Authority: CN
Inventors: 刘丁赟; 饶妮妮; 刘汉明; 郑洁; 黎桑; 曾伟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2019-04-23
Anticipated expiration: 2035-05-18
Also published as: CN105005783A

Abstract

The present invention relates to signals and field of image processing, a kind of method for extracting classification information from higher-dimension asymmetric data is provided, to solve existing relevant classification information extracting method or the asymmetric data of unsuitable sample, it is easy to happen the problem of calculation amount is overflowed when calculating complicated height, processing high dimensional data, this method comprises: obtaining higher-dimension asymmetric data；To Σ_oAnd Σ_cIt is assigned to new weight, forms new covariance matrix Σ_αInstead of Σ_tFeature decomposition is carried out, its characteristic value and feature vector are solved；Combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data by dimensionality reduction matrix is projected to obtain the classification information after dimensionality reduction.Technical solution computation complexity proposed by the present invention is low, accuracy is high, the speed of service is fast, stability is good.

Description

The method of classification information is extracted from higher-dimension asymmetric data

Technical field

The present invention relates to signal and field of image processing, in particular to one kind extracts classification letter from higher-dimension asymmetric data The method of breath.

Background technique

The method that classification information is extracted from two class sample datas has highly important practical application value.For example, with The classification information being extracted distinguishes face and inhuman face image, distinguishes disease sample and non-disease sample and identifies useful letter Breath and garbage etc..As the technology and means that obtain information are increasingly advanced, the two class data dimensions for needing to classify are more and more It is huge, along with two class sample sizes of acquisition are usually unbalanced, so that two traditional class sample classification methods are by compared with the day of one's doom System.Therefore, there is an urgent need to one kind classification information can be extracted from higher-dimension, the two asymmetric big datas of class sample number method, To meet the needs of massive information society every field development.

Principal component analysis (principal component analysis, PCA) is a kind of most common non-supervisory Formula Multielement statistical analysis method, this method are mainly to carry out signature analysis to the covariance matrix of data set, are reconstructed minimizing The main component in data is isolated under conditions of error, as classification information.PCA has simplified data capability by force and realization is difficult Spend lower feature.However, when PCA is when facing unbalanced sample, although it can will be reconstructed in principal component space Information maximizes, but cannot be effectively maintained the information for being conducive to classification, this will lead under the classification performance of entire application system Drop.The arch-criminal that interference PCA correctly classifies is: when the sample size of a kind of data (referred to as positive sample) is less than another kind of data When the sample size of (referred to as negative sample), the corresponding feature vector of positive class conditional covariance matrix small feature value can occur sternly It deviates again.In order to improve the defect of PCA, a kind of asymmetric PCA (Asymmetric Principal Component Analysis, referred to as APCA) method is suggested.APCA emphasis eliminates the factor that interference PCA correctly classifies, to positive class item Part covariance matrix and negative class conditional covariance matrix are assigned to new weight, form new covariance matrix instead of the total of PCA Feature decomposition is carried out after body scatter matrix again.Compared to PCA method, APCA method extracts the classification information ability of lack of balance data Have a large increase, but it when handling high dimensional data (such as some medical images) it occur frequently that calculation amount spillover. Reason is that the new covariance matrix of APCA building is formed by multiple square matrix linear combinations having a size of n × n, and wherein n is number According to original dimensions.In many practical applications, the original dimensions of data are all bigger, for example, one having a size of 200 × 200 Image just have 40000 pixels, i.e., 40000 dimensions, therefore, APCA is when calculating the characteristic value of high dimensional data covariance matrix It is easy to cause calculator memory to overflow and subsequent calculating can not be continued, that is, allows to calculate, such huge matrix dimension Also high computation complexity will necessarily be brought, the time is calculated and error all can substantial increase.

It can be seen that existing relevant classification information extracting method or the asymmetric data of unsuitable sample or calculating Calculation amount spilling is easy to happen when complicated height, processing high dimensional data.

Summary of the invention

[technical problems to be solved]

The purpose of the present invention is being to solve drawbacks described above present in background technique, introducing Joint diagonalization is theoretical, A kind of method that classification information is extracted from higher-dimension, the two asymmetric big datas of class sample number is had devised and embodied, for the ease of Illustrate, the method provided by the invention that classification information is extracted from higher-dimension asymmetric data is named as Joint diagonalization principal component It analyzes (Joint Diagonalization Principal Component Analysis, JDPCA).

[technical solution]

The present invention is achieved by the following technical solutions.

The method that the present invention relates to a kind of to extract classification information from higher-dimension asymmetric data, this method include following step It is rapid:

Step A: obtaining higher-dimension asymmetric data, and the higher-dimension asymmetric data is made of positive sample and negative sample, Analysis obtains the dimension n of the higher-dimension asymmetric data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample This sample size q_o, the negative sample sample size q_c, the dimension m of classification information to be extracted is set；

Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculated_o, negative class The mean vector M of sample_c, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectively_o、 Negative sample set matrix S after centralization_c；

Step C: matrix X is constructed respectively_o, matrix X_c, matrix X_mo, matrix X_mc, whereinα_o=q_c/ q、α_c=q_o/q、

Step D: calculating matrix X_o ^TX_oNonzero eigenvalueWith corresponding feature vectorMatrix X_c ^TX_cIt is non- Zero eigenvalueWith corresponding feature vectorX_mo ^TX_moEigenvalue λ_moWith corresponding feature vector u_mo, matrix X_mc ^TX_mcEigenvalue λ_mcWith corresponding feature vector u_mc；

Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to the feature vector calculated in step D, and construct matrixWherein

Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value from big { λ is obtained to minispread^(k)And corresponding feature vector { u^(k), wherein k=1,2 ... q takes { u^(k)In preceding m characteristic value pair The combination of eigenvectors answered is at dimensionality reduction matrix Φ_m, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix Φ_mIt is projected to obtain Classification information after dimensionality reduction.

As a preferred embodiment, the step D is specifically included:

Step D1: calculating matrix X_oX_o ^TNonzero eigenvalueWith corresponding feature vectorWherein i=1, 2,...q_o- 1, calculating matrix X_cX_c ^TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2, ...q_c- 1, calculating matrix X_moX_mo ^TEigenvalue λ_moWith corresponding feature vector v_mo, calculating matrix X_mcX_mc ^TEigenvalue λ_mcWith Corresponding feature vector v_mc；

Step D2: feature vector is calculated separatelyFeature vectorFeature vector u_mo, feature vector u_mc, In

As another preferred embodiment, calculating matrix in the step FCharacteristic value { λ^(k)And it is corresponding Feature vector { u^(k)Method are as follows:

Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small Obtain { λ^(k)And corresponding feature vector { v^(k), wherein k=1,2 ... q；

According toMatrix is calculatedCharacteristic value { λ^(k)Corresponding feature vector { u^(k)}。

Positive sample set matrix S as another preferred embodiment, after the centralization_oForNegative sample set matrix S after the centralization_cAre as follows:It is describedFor positive class sample, wherein i=1,2 ... q_o, It is describedFor negative class sample, wherein j=1,2 ... q_c。

As another preferred embodiment, the mean vector of the asymmetric sample data of higher-dimension isThe mean vector of the feminine gender class sample isThe feminine gender class sample This mean vector isIt is describedFor positive class sample, wherein i=1,2 ... q_o, described For negative class sample, wherein j=1,2 ... q_c。

As another preferred embodiment, the dimension n and the asymmetric sample of higher-dimension of the asymmetric sample data of higher-dimension The total number of samples amount q of notebook data meets: n >=3q.

As another preferred embodiment, the asymmetric sample data of higher-dimension is image data, gene expression number According to or genome-wide association study data.

As another preferred embodiment, each data element in the higher-dimension asymmetric data is real number.

Technical solution of the present invention is described in detail below.

The present invention is directed higher-dimension asymmetric data, which is made of positive sample and negative sample, specifically, leads to Cross total sample that the dimension of higher-dimension asymmetric data analyzed and got in the available present invention is n, higher-dimension asymmetric data Quantity is q, the sample size of positive sample is q_o, negative sample sample size q_c, then q=q_o+q_c, due to for asymmetric number According to, therefore q_o≠q_c, additionally, due to being high dimensional data, then n > > q, symbol " > > " expression are much larger than, and generally, higher-dimension is non-right The dimension n of data is claimed at least to should be 3 times of total number of samples amount q of higher-dimension asymmetric data, i.e. n >=3q.

Specifically, for the positive sample in higher-dimension asymmetric dataBy q_oThe sample group that a row vector indicates At, wherein footnote o indicates positive class sample,I-th of sample in expression positive sample, i=1,2 ... q_o,'s Mean vector M_oAre as follows:

Class conditional covariance matrix Σ_oAre as follows:

For the negative sample in higher-dimension asymmetric dataBy q_cThe sample that a row vector indicates forms, wherein foot Marking c indicates negative class sample,J-th of sample in expression negative sample, j=1,2 ... q_c, same using the above method It can solve to obtainMean vector M_cWith class conditional covariance matrix Σ_c。

Higher-dimension asymmetric data is the union of above-mentioned two classes sample, i.e.,It can equally solve to obtain its mean value Vector M solves to obtain the higher-dimension asymmetric data X of centralization by mean vector M, specifically,

PCA in the prior art is to utilize the lesser matrix XX of size^TIndirectly to total population scatter matrix Σ_tCarry out feature It decomposes, the corresponding feature vector composition dimensionality reduction matrix of the maximum preceding m characteristic value of value, then any one n dimension data is passed through The dimensionality reduction matrix is projected, and dimension is down to m dimension, wherein total population scatter matrix Σ_tAs shown in formula (3), between class scatter matrix Σ_mAs shown in formula (4).

JDPCA method provided by the invention is to Σ_oAnd Σ_cIt is assigned to new weight, forms new covariance matrix Σ_αInstead of Σ_tFeature decomposition is carried out, its characteristic value and feature vector, covariance matrix Σ are solved_αAs shown in formula (5),

Σ_α=α_oΣ_o+α_cΣ_c+Σ_m (5)

Due to Σ_αIn the weights of two class conditional covariance matrixs become α_o=(q_c/q)、α_c=(q_o/ q), it is no longer two The estimated value of the prior probability of a class, so meeting equation'sCentralization cannot be passed through as PCA Higher-dimension asymmetric data X is directly acquired.In order to solve to obtain the condition of satisfactionThe present invention, which passes through, finds a matrix U, so that U can be to composition Σ_αAll matrixes carry out diagonalization simultaneously.After diagonalization is realized, then by after matrix U and diagonalization Diagonal matrix constructs matrixIt is generated since entire calculating process does not have the matrix that size is more than n × n, so JDPCA Computation complexity will substantially reduce.However, existing Joint diagonalization method is approximate algorithm, it usually needs iteration or meter of inverting It calculates, if JDPCA directlys adopt these algorithms, the information that not only PCA is extracted can be warped, but also its calculation amount will increase.For This, the low-rank and real symmetry characteristic of above-mentioned covariance matrix is dexterously utilized in the present invention, devises a kind of fast and accurately new Non-orthogonal joint diagonalization algorithm find matrixMake JDPCA that dimension disaster will not occur when handling high dimensional data to ask Topic.

Conventional Joint diagonalization problem can be described as following form: for the matrix A of L n × n size₁, A₂...A_L, find U and L corresponding diagonal matrix Λ of a diagonalizable matrix₁、Λ₂、...Λ_L, so as to arbitrary l ∈ 1,2, 3...L } it is all satisfied A_l=U Λ_lU^H.Matrix due to needing Joint diagonalization in the present invention is real symmetric matrix, then conjugation turns It sets " H " and is written as transposition " T " without exception.According to formula (3), formula (4), formula (5), enable: α_oΣ_o=Σ_o, α_cΣ_c=Σ_c, α_c(M_o-M)(M_o- M)^T=Σ_mo, α_o(M_c-M)(M_c-M)^T=Σ_mc, then Σ_αIt can indicate are as follows:

Σ_α=Σ_o+Σ_c+Σ_mo+Σ_mc (6)

It is an object of the present invention to find a matrix U and 4 diagonal matrix Λ_o、Λ_c、Λ_mo、Λ_mc, enable matrix U same When Σ_o、Σ_c、Σ_mo、Σ_mcThis four square matrixes diagonally turn to corresponding diagonal matrix, Σ_αFollowing form can be decomposed into:

Σ_α=U Λ_oU^T+UΛ_cU^T+UΛ_moU^T+UΛ_mcU^T=U (Λ_o+Λ_c+Λ_mo+Λ_mc)U^T (7)

Then, the matrix in formula (7)The matrix exactly of the invention foundIts In, four diagonal matrixs can choose the characteristic value building of corresponding former square matrix, and difficult point is the calculating of diagonalizable matrix U.Due to square Battle array Σ_o、Σ_c、Σ_mo、Σ_mcAll there is real symmetry characteristic and size is n × n, and as q < < n, matrix Σ_o、Σ_c、Σ_mo、 Σ_mcOrder be respectively q_o-1、q_c- 1,1,1, it is much smaller than dimension n, and for general real symmetric matrix A_k, diagonalization shape Formula are as follows: A_l=U Λ_lU^T, the transformation in formula has the property that

(a) change Λ_lMiddle zero eigenvalue corresponding feature vector in diagonalizable matrix U, this equation are still set up.

(b) Λ is exchanged simultaneously_lIt is middle a pair of characteristic value and matrix U in their corresponding feature vectors position, this equation according to Old establishment.

(c) Λ is directly deleted_lIn zero eigenvalue and matrix U in corresponding feature vector content and place column, these Formula is still set up.

(d) in Λ_lIn artificial addition zero eigenvalue behind original characteristic value, and the corresponding position addition zero in matrix U Vector, this equation are still set up.

So far, above-mentioned property and singular value decomposition theorem be can use, matrix Σ is found out_o、Σ_c、Σ_mo、Σ_mcAll spies Value indicative and feature vector, and using their low-rank characteristic, change the position of their characteristic values and feature vector, by one of them The corresponding feature vector of part zero eigenvalue of matrix is converted to other corresponding feature vectors of three matrix non-zero characteristic values, most The matrix U of the condition of satisfaction can be pieced together out eventually.Matrix is constructed with the diagonal matrix after matrix U and diagonalizationAfterwards, pass through matrix Construct dimensionality reduction matrix Φ_m, higher-dimension asymmetric data is passed through into dimensionality reduction matrix Φ_mIt is projected to obtain the classification information after dimensionality reduction.

[beneficial effect]

Compared with prior art, technical solution proposed by the present invention has the advantage that

(1) when two class sample unbalanced to quantity carries out dimensionality reduction, the redundancy of usual minority class sample is than most classes Redundancy it is more unstable, if these redundancies reject it is inadequate, can classification when lead to serious over-fitting. Most class samples are since training sample is more, so its redundancy is relatively reliable and stable, wherein also including a part of believable two Class different information.For this purpose, the present invention increases the dynamics for rejecting the unstable redundancy of minority class sample, it is most to reduce rejecting The dynamics of class sample redundancy.The principal component that the present invention remains is " principal component for best embodying the difference of two classes ".Therefore, Two classes sample data unbalanced for quantity, each principal component extracted through the invention is in the otherness for distinguishing two class samples Can be more more obvious than traditional PCA, and equally there is orthogonality between these principal components, it is irrelevant each other.

(2) due to entire calculating process of the invention do not have size be more than n × n matrix generate with operation (wherein n is The original dimensions of data), so computation complexity of the invention is greatly diminished.When handling two class sample data of higher-dimension, Calculated result accuracy of the present invention is high, the speed of service is fast, stability is good.

Detailed description of the invention

The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention Figure.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing, to of the invention specific Embodiment carries out clear, complete description, it is clear that and described embodiment is a part of the embodiments of the present invention, rather than Whole embodiments, nor limitation of the present invention.Based on the embodiment of the present invention, those of ordinary skill in the art are not paying Every other embodiment obtained, belongs to protection scope of the present invention under the premise of creative work.

The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention one Figure.As shown in Figure 1, the method comprising the steps of S11 to step S19, is separately below described in detail above-mentioned steps.

Step S11: higher-dimension asymmetric data is obtained.

Specifically, higher-dimension asymmetric data is made of positive sample and negative sample, and analysis obtains higher-dimension asymmetric data Dimension n, the total number of samples amount q of higher-dimension asymmetric data, positive sample sample size q_o, negative sample sample size q_c, The dimension m of classification information to be extracted is set.In the present embodiment,For positive class sample, wherein i=1,2 ... q_o,For negative class sample, wherein j=1,2 ... q_c。

Step S12: mean vector is calculated, the positive sample set matrix and negative sample set square of centralization is calculated Battle array.

Specifically, the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculated_o, it is negative The mean vector M of class sample_c, centralization positive sample and negative sample obtain the positive sample set matrix after centralization respectively S_o, negative sample set matrix S after centralization_c.The method for solving of each mean vector describes in summary of the invention.This reality It applies in example, the positive sample set matrix S after centralization_oForIn Negative sample set matrix S after the heart_cAre as follows:

Step S13: matrix X is constructed respectively_o、X_c、X_mo、X_mc。

Specifically,α_o=q_c/q、α_c=q_o/q、Matrix X_oSize be q_o× n, it obviously meets X_o ^TX_o= Σ_o, similarly know matrix X_cEqually meet X_c ^TX_c=Σ_c；Matrix X_moSize be 1 × n, it obviously meetMatrix X known to similarly_mcIt is same to meet

Step S14: matrix X is calculated separately_oX_o ^T、X_cX_c ^T、X_moX_mo ^T、X_mcX_mc ^TNonzero eigenvalue and corresponding feature to Amount.

Specifically, matrix X_oX_o ^TSize be q_o×q_o, matrix X is calculated_oX_o ^TNonzero eigenvalueAnd correspondence Feature vector Indicate X_oX_o ^TI-th of nonzero eigenvalue,Indicate X_oX_o ^TI-th of nonzero eigenvalue pair The feature vector answered, wherein i=1,2 ... q_o-1.Matrix X_cX_c ^TSize be q_c×q_c, matrix X is calculated_cX_c ^TNon-zero Characteristic valueWith corresponding feature vectorWherein j=1,2 ... q_c- 1,Indicate X_cX_c ^TJ-th of non-zero it is special Value indicative,Indicate X_cX_c ^TThe corresponding feature vector of j-th of nonzero eigenvalue.Matrix X_moX_mo ^TSize be 1 × 1, feature Value is itself, is denoted as λ_mo, corresponding feature vector is denoted as v_mo.Matrix X_mcX_mc ^TSize be 1 × 1, characteristic value be its Body is denoted as λ_mc, corresponding feature vector is denoted as v_mc。

Step S15: solution obtains matrix X_o ^TX_o、X_c ^TX_c、X_mo ^TX_mo、X_mc ^TX_mcNonzero eigenvalue and corresponding feature to Amount.

Step S15 utilizes singular value decomposition theorem, i.e. X_o ^TX_oWith X_oX_o ^TNonzero eigenvalue is identical, and feature vector, which exists, fixes The property of corresponding relationship solves X_o ^TX_oFeature vector, can similarly solve to obtain X_c ^TX_c、X_mo ^TX_mo、X_mc ^TX_mcNon-zero characteristics Value and corresponding feature vector.Specifically, feature vector is calculated according to the following formulaFeature vectorFeature to Measure u_mo, feature vector u_mc, i.e.,

Step S16: diagonalizable matrix U and diagonal matrix Λ are pieced together out.

In step S16, the diagonalizable matrix that four sizes are n × q and the diagonal matrix that four sizes are q × q are constructed first:

According to the property (b) of diagonalization algorithm, (c), (d), it is not difficult to verify its satisfaction: Σ_mo=U_moΛ_moU_mo ^T, Σ_mc=U_mcΛ_mcU_mc ^T。

Then combination obtains diagonalizable matrix U and diagonal matrix Λ:

It, can be to forming Σ with validation matrix U according to the property (a) of diagonalization algorithm_αFour square matrix Σ_o、Σ_c、 Σ_mo、Σ_mcJoint diagonalization is carried out, diagonalization result is exactly Λ_o、Λ_c、Λ_mo、Λ_mc。

Step S17: building matrixCalculating matrixCharacteristic value and corresponding feature vector.

Construct matrix According to formula (7) it is found thatIt sets up.Calculating matrix's Characteristic value { λ^(k)And corresponding feature vector { v^(k), characteristic value { λ^(k)Arrange from big to small, wherein k=1,2 ... q.

Step S18: calculating matrixCharacteristic value and corresponding feature vector.

Specifically, according to singular value decomposition theorem, by formulaCalculate each characteristic value { λ^(k)} Corresponding feature vector { u^(k)}。

Step S19: combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data is projected to obtain dimensionality reduction by dimensionality reduction matrix Classification information afterwards.

Specifically, { u is taken^(k)In the corresponding combination of eigenvectors of preceding m characteristic value at dimensionality reduction matrix Φ_m, by step S11 The higher-dimension asymmetric data of middle acquisition passes through dimensionality reduction matrix Φ_mIt is projected to obtain the classification information after dimensionality reduction.

The classification information that application method provided in an embodiment of the present invention carries out two class sample datas below extracts experiment.In order to From different dimension scales and two different class sample proportions in terms of the two to the performance of the embodiment of the present invention carry out verifying and Compare, carry out experiment using two groups of data, is referred to as group A data and group B data.Wherein, group A data are for verifying the present invention Accuracy rate and arithmetic speed of the embodiment under different dimensions data；Group B data is for verifying the embodiment of the present invention unbalanced Class resolution capability under sample.Every group of data all include positive sample and negative sample, and experimental data is described as follows.

Group A data: extracting different dimensions data classification information performance to achieve the purpose that verify the embodiment of the present invention, raw At group A data.In group A data, positive sample and negative sample number are respectively set to 500, and therefore, total number of samples is 1000.Wherein, the mean value perseverance of all dimensions of positive sample is 0, and the variance of i-th of dimension is 1/i^0.5.The mean value of negative sample is non- Zero and different dimensions mean value it is different, the mean value of j-th of dimension is 1/ (8j)^0.25, variance is 1/ (50j)^0.25.Design two in this way The reasons why class sample, is as follows: (1) guaranteeing that the mean value of two classes and variance all have differences, it is comprehensive to guarantee that difference has；(2) two The big dimension variance difference of class mean value difference is also big (being concentrated mainly on preceding 20 dimension), the small dimension variance difference of class mean value difference Small, i.e., excessively apparent separation trend will not be presented in dimension difference and variance difference in whole dimension, so that each dimension All correct classification is contributed, classification accuracy can also increase when guaranteeing that total dimension increases；(3) two classes are the same as one-dimensional Mean value difference and variance difference on degree is all not too big, so that relying only on some or certain dimensions are difficult to separate two classes, allows Various methods guarantee certain accuracy when identifying two classes, but cannot easily reach 100%.Total dimension n of group A data N=10000 is risen to by n=1500 for step-length with 500, thus obtains multiple data that property is identical, dimension is different.

Group B data: the data are the face downloaded from MIT facial image database and non-face image data, and selection is wherein 1000 open exhibition experiment.In experiment, fixed total number of samples (1000) are constant, participate in trained facial image (positive) sample ratio Example is changed to 5% by 50%, i.e., is gradually reduced to 45 from 450.Two class sample numbers change into lack of balance shape from equilibrium state Thus state obtains multiple data that dimension is identical, two class sample numbers are different.

With the embodiment of the present invention after extracting classification information in above-mentioned data, then use improved support vector machine (ODR-BSMOTE-SVM, abbreviation OB-SVM) carries out sample classification according to the classification information of extraction.In OB-SVM classifier, core Function is fixed as gaussian kernel function, takes balance parameters α=0.9 of ODR and BSMOTE.The verification method of all experiments is ten foldings Cross validation, classifying quality average sensitivity (Sensitivity, abbreviation Sen), specificity (Specificity, abbreviation Spe it) is assessed with accuracy rate (Accuracy, abbreviation Acc).Enabling FP is negative sample mistake to be divided into positive number, and FN is Positive sample mistake is divided into the number of negative sample；TP and TN respectively indicates the number that positive sample and negative sample are correctly classified Mesh, then sensibility, specificity and accuracy rate are defined as follows.

Sensitivity=TP/ (TP+FN)

Specificity=TN/ (FP+TN)

Accuracy=(TP+TN)/(TP+TN+FP+TN)

Experiment one: performance and arithmetic speed verifying under different dimensions.

Experiment one is using group A data.The total number of samples amount q=900 for participating in training every time is (positive because being the verifying of ten foldings Each 450) with negative sample, test sample is 100 (positive and each 50) of negative sample.Total dimension is step-length by n=1500 with 500 Rise to n=10000.In the group A data of different dimensions, JDPCA, APCA and PCA method are executed respectively, then carries out OB- Svm classifier.Wherein, dimensionality reduction parameter m is fixed as 50, it is to be appreciated that dimensionality reduction parameter m, that is, the dimension m of classification information to be extracted. The average classification performance and calculating time that ten folding cross validations obtain under each dimension are as shown in table 1.Due to q_o=q_c=450, Two class sample sizes are balanced, so the covariance matrix Σ that JDPCA and APCA are acquired_αThe total population scatter matrix Σ acquired with PCA_t Feature decomposition result there is no difference, the average sensitivity of three kinds of methods, specificity, accuracy rate are identical with AUC.Therefore, Table 1 only lists a kind of above-mentioned performance number of method.Rear the three of table 1 are classified as three kinds of methods in the group A data of processing different dimensions When required average workout times, OM therein indicates that memory overflows (Out of Memory, OM), i.e. operation can not continue, Method is forced to interrupt.

Classification performance and calculating time under 1 different dimensions data of table based on JDPCA, APCA and PCA classification information

Seen from table 1, when total dimension n rises to 10000 by 1500, with the accuracy rate phase of the lower three kinds of methods of dimension Together, but with the increase of dimension, accuracy rate slowly rises to 100% by 91.7%.However, the operation time of three kinds of methods is poor It is different very big.The operation time of PCA and JDPCA linearly increases, dimension n it is every increase by 500 when, PCA operation time balanced growth Operation time balanced growth 5s of 1.2s, JDPCA or so.And the operation time of APCA is increased in square formula, reaches 9500 in n When just have occurred and that memory overflow.Since the present embodiments relate to more intermediate variable and its corresponding arithmetic operation, institutes With when dimension n is lower, JDPCA does not show advantage compared to APCA on calculating the time, but when dimension is more than 2500 Afterwards, the advantage of arithmetic speed just becomes apparent upon.When facing high dimensional data, the dimension disaster problem of APCA becomes tight Weight, computation complexity steeply rise, and overflow and can not calculate so that memory occurs when data dimension reaches 9500, and of the invention This problem is not present in the JDPCA of proposition, this is because the Joint diagonalization algorithm of design of the embodiment of the present invention has avoided n × n The generation and calculating (the wherein original dimensions that n is data) of big minor matrix, and be not related to it is any invert and the complex operations such as iteration, Its Σ_αCharacteristic value and feature vector calculate more quick and precisely.For larger-sized data, the computation complexity of APCA is just It is bigger, cause arithmetic speed excessively it is slow possibly even can not operation, this is difficult to meet the need that current each field needs to handle big data It asks.Although the present invention is fast not as good as the speed of PCA, compared with APCA, operation time is greatly lowered.

Experiment two: the class resolution capability verifying under unbalanced data.

Experiment two is using group B data.In order to eliminate it is different calculating on different dimensions computation complexity difference to the experiment Influence, the dimensional standard of all images is turned into 45 × 45 (n=2025).Participate in the total number of samples q=900 of training.Wherein, Facial image sample proportion (Proportion, P) is changed to 5% by 50%, i.e., is gradually reduced to 45 by 450.Dimensionality reduction ginseng Number m is fixed as 50.Ten folding cross validation results are as shown in table 2.

Classification performance based on JDPCA, APCA and PCA classification information under 2 different faces image scaled of table

As can be seen from Table 2, tri- kinds of methods of JDPCA, APCA and PCA in the MIT face database unequal sample numbers according to when Specificity and accuracy rate difference are smaller, and wherein JDPCA and APCA causes final result not have since the theoretical value of calculated result is identical It is variant.The difference of three is mainly reflected in sensibility.Obviously, gradually decreasing with face sample proportion, the sensitivity of PCA Property declines to a great extent, and fall ratio JDPCA and APCA is bigger.When face sample only accounts for the 5% of population sample, PCA's is quick Low up to the 6% of perception ratio JDPCA and APCA.It can be seen that JDPCA provided in an embodiment of the present invention remains APCA in face of not Class resolving power advantage under equalization data collection.

Therefore, JDPCA proposed by the present invention has both operation speed when extracting the classification information of higher-dimension, unequal sample numbers evidence The advantage that degree is fast and accuracy rate is high has extensive practical in fields such as communication, radar and biomedicine signals/image procossings Application value.

From above embodiments and related experiment can be seen that the embodiment of the present invention increase reject minority class sample it is unstable The dynamics of redundancy reduces the dynamics for rejecting most class sample redundancies, and in addition the embodiment of the present invention remains Principal component is to best embody the principal component of two classes difference, therefore, two classes sample data unbalanced for quantity, and through the invention Embodiment extract each principal component distinguish two class samples otherness on can it is more more obvious than traditional PCA, and these it is main at / equally have orthogonality, it is irrelevant each other.In addition, the entire calculating process due to the embodiment of the present invention will not Having size is more than the matrix generation and operation (the wherein original dimensions that n is data) of n × n, so the calculating of the embodiment of the present invention Complexity is greatly diminished.When handling two class sample data of higher-dimension, calculated result of embodiment of the present invention accuracy is high, runs Speed is fast and stability is good, and the calculating error of tradition APCA method is larger and is easy to happen calculation amount spilling.

Claims

1. a kind of image classification method is applied to the asymmetric situation of sample, it is characterised in that include the following steps:

Image to be classified is inputted, the image to be classified includes asymmetric positive sample and negative sample, diagonal according to joint Change principal component analytical method and extract classification information from positive sample and negative sample, comprising:

Step A: the higher-dimension asymmetric data being made of positive sample and negative sample is obtained, it is asymmetric that analysis obtains the higher-dimension The dimension n of data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample sample size q_o, the feminine gender The sample size q of sample_c, the dimension m of classification information to be extracted is set；

Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculated_o, negative class sample Mean vector M_c, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectively_o, center Negative sample set matrix S after change_c；

Step C: matrix X is constructed respectively_o, matrix X_c, matrix X_mo, matrix X_mc, whereinα_o=q_c/q、α_c=q_o/q、Step D: square is calculated Battle array X_o ^TX_oNonzero eigenvalueWith corresponding feature vectorMatrix X_c ^TX_cNonzero eigenvalueAnd correspondence Feature vectorX_mo ^TX_moEigenvalue λ_moWith corresponding feature vector u_mo, matrix X_mc ^TX_mcEigenvalue λ_mcAnd correspondence Feature vector u_mc；

Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to feature vector obtained in step D, and construct matrix Wherein

Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small Column obtain { λ^(k)And corresponding feature vector { u^(k), wherein k=1,2 ... q takes { u^(k)In the corresponding spy of preceding m characteristic value Sign vector is combined into dimensionality reduction matrix Φ_m, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix Φ_mIt is projected after obtaining dimensionality reduction Classification information；

Classified according to classification information to the image to be classified；

The step D is specifically included:

Step D1: calculating matrix X_oX_o ^TNonzero eigenvalueWith corresponding feature vectorWherein i=1,2, ...q_o- 1, calculating matrix X_cX_c ^TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2 ... q_c- 1, calculating matrix X_moX_mo ^TEigenvalue λ_moWith corresponding feature vector v_mo, calculating matrix X_mcX_mc ^TEigenvalue λ_mcWith it is corresponding Feature vector v_mc；

Step D2: feature vector is calculated separatelyFeature vectorFeature vector u_mo, feature vector u_mc, wherein

2. image classification method according to claim 1, it is characterised in that:

The image to be classified is face and inhuman face image, or is disease sample and non-disease sample.

3. image classification method according to claim 1, it is characterised in that calculating matrix in the step FSpy Value indicative { λ^(k)And corresponding feature vector { u^(k)Method are as follows:

Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange obtain from big to small {λ^(k)And corresponding feature vector { v^(k), wherein k=1,2 ... q；

4. image classification method according to claim 1 or 2 or 3, it is characterised in that: the positive sample after the centralization Gather matrix S_oForNegative sample set square after the centralization Battle array S_cAre as follows:It is describedFor positive class sample, wherein i=1, 2,...q_o, describedFor negative class sample, wherein j=1,2 ... q_c。

5. image classification method according to claim 1 or 2 or 3, it is characterised in that: the asymmetric sample data of higher-dimension Mean vector beThe mean vector of the positive class sample isIt is described The mean vector of negative class sample isIt is describedFor positive class sample, wherein i=1,2 ... q_o, institute It statesFor negative class sample, wherein j=1,2 ... q_c。

6. image classification method according to claim 1 or 2 or 3, it is characterised in that the asymmetric sample data of higher-dimension Dimension n and the asymmetric sample data of higher-dimension total number of samples amount q meet: n >=3q.

7. image classification method according to claim 6, it is characterised in that the asymmetric sample data of higher-dimension is image Data, gene expression data or genome-wide association study data.

8. image classification method according to claim 1 or 2 or 3, it is characterised in that in the higher-dimension asymmetric data Each data element is real number.