CN105005783B - The method of classification information is extracted from higher-dimension asymmetric data - Google Patents

The method of classification information is extracted from higher-dimension asymmetric data Download PDF

Info

Publication number
CN105005783B
CN105005783B CN201510251168.4A CN201510251168A CN105005783B CN 105005783 B CN105005783 B CN 105005783B CN 201510251168 A CN201510251168 A CN 201510251168A CN 105005783 B CN105005783 B CN 105005783B
Authority
CN
China
Prior art keywords
matrix
sample
dimension
data
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510251168.4A
Other languages
Chinese (zh)
Other versions
CN105005783A (en
Inventor
刘丁赟
饶妮妮
刘汉明
郑洁
黎桑
曾伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201510251168.4A priority Critical patent/CN105005783B/en
Publication of CN105005783A publication Critical patent/CN105005783A/en
Application granted granted Critical
Publication of CN105005783B publication Critical patent/CN105005783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The present invention relates to signals and field of image processing, a kind of method for extracting classification information from higher-dimension asymmetric data is provided, to solve existing relevant classification information extracting method or the asymmetric data of unsuitable sample, it is easy to happen the problem of calculation amount is overflowed when calculating complicated height, processing high dimensional data, this method comprises: obtaining higher-dimension asymmetric data;To ΣoAnd ΣcIt is assigned to new weight, forms new covariance matrix ΣαInstead of ΣtFeature decomposition is carried out, its characteristic value and feature vector are solved;Combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data by dimensionality reduction matrix is projected to obtain the classification information after dimensionality reduction.Technical solution computation complexity proposed by the present invention is low, accuracy is high, the speed of service is fast, stability is good.

Description

The method of classification information is extracted from higher-dimension asymmetric data
Technical field
The present invention relates to signal and field of image processing, in particular to one kind extracts classification letter from higher-dimension asymmetric data The method of breath.
Background technique
The method that classification information is extracted from two class sample datas has highly important practical application value.For example, with The classification information being extracted distinguishes face and inhuman face image, distinguishes disease sample and non-disease sample and identifies useful letter Breath and garbage etc..As the technology and means that obtain information are increasingly advanced, the two class data dimensions for needing to classify are more and more It is huge, along with two class sample sizes of acquisition are usually unbalanced, so that two traditional class sample classification methods are by compared with the day of one's doom System.Therefore, there is an urgent need to one kind classification information can be extracted from higher-dimension, the two asymmetric big datas of class sample number method, To meet the needs of massive information society every field development.
Principal component analysis (principal component analysis, PCA) is a kind of most common non-supervisory Formula Multielement statistical analysis method, this method are mainly to carry out signature analysis to the covariance matrix of data set, are reconstructed minimizing The main component in data is isolated under conditions of error, as classification information.PCA has simplified data capability by force and realization is difficult Spend lower feature.However, when PCA is when facing unbalanced sample, although it can will be reconstructed in principal component space Information maximizes, but cannot be effectively maintained the information for being conducive to classification, this will lead under the classification performance of entire application system Drop.The arch-criminal that interference PCA correctly classifies is: when the sample size of a kind of data (referred to as positive sample) is less than another kind of data When the sample size of (referred to as negative sample), the corresponding feature vector of positive class conditional covariance matrix small feature value can occur sternly It deviates again.In order to improve the defect of PCA, a kind of asymmetric PCA (Asymmetric Principal Component Analysis, referred to as APCA) method is suggested.APCA emphasis eliminates the factor that interference PCA correctly classifies, to positive class item Part covariance matrix and negative class conditional covariance matrix are assigned to new weight, form new covariance matrix instead of the total of PCA Feature decomposition is carried out after body scatter matrix again.Compared to PCA method, APCA method extracts the classification information ability of lack of balance data Have a large increase, but it when handling high dimensional data (such as some medical images) it occur frequently that calculation amount spillover. Reason is that the new covariance matrix of APCA building is formed by multiple square matrix linear combinations having a size of n × n, and wherein n is number According to original dimensions.In many practical applications, the original dimensions of data are all bigger, for example, one having a size of 200 × 200 Image just have 40000 pixels, i.e., 40000 dimensions, therefore, APCA is when calculating the characteristic value of high dimensional data covariance matrix It is easy to cause calculator memory to overflow and subsequent calculating can not be continued, that is, allows to calculate, such huge matrix dimension Also high computation complexity will necessarily be brought, the time is calculated and error all can substantial increase.
It can be seen that existing relevant classification information extracting method or the asymmetric data of unsuitable sample or calculating Calculation amount spilling is easy to happen when complicated height, processing high dimensional data.
Summary of the invention
[technical problems to be solved]
The purpose of the present invention is being to solve drawbacks described above present in background technique, introducing Joint diagonalization is theoretical, A kind of method that classification information is extracted from higher-dimension, the two asymmetric big datas of class sample number is had devised and embodied, for the ease of Illustrate, the method provided by the invention that classification information is extracted from higher-dimension asymmetric data is named as Joint diagonalization principal component It analyzes (Joint Diagonalization Principal Component Analysis, JDPCA).
[technical solution]
The present invention is achieved by the following technical solutions.
The method that the present invention relates to a kind of to extract classification information from higher-dimension asymmetric data, this method include following step It is rapid:
Step A: obtaining higher-dimension asymmetric data, and the higher-dimension asymmetric data is made of positive sample and negative sample, Analysis obtains the dimension n of the higher-dimension asymmetric data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample This sample size qo, the negative sample sample size qc, the dimension m of classification information to be extracted is set;
Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, negative class The mean vector M of samplec, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectivelyo、 Negative sample set matrix S after centralizationc
Step C: matrix X is constructed respectivelyo, matrix Xc, matrix Xmo, matrix Xmc, whereinαo=qc/ q、αc=qo/q、
Step D: calculating matrix Xo TXoNonzero eigenvalueWith corresponding feature vectorMatrix Xc TXcIt is non- Zero eigenvalueWith corresponding feature vectorXmo TXmoEigenvalue λmoWith corresponding feature vector umo, matrix Xmc TXmcEigenvalue λmcWith corresponding feature vector umc
Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to the feature vector calculated in step D, and construct matrixWherein
Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value from big { λ is obtained to minispread(k)And corresponding feature vector { u(k), wherein k=1,2 ... q takes { u(k)In preceding m characteristic value pair The combination of eigenvectors answered is at dimensionality reduction matrix Φm, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected to obtain Classification information after dimensionality reduction.
As a preferred embodiment, the step D is specifically included:
Step D1: calculating matrix XoXo TNonzero eigenvalueWith corresponding feature vectorWherein i=1, 2,...qo- 1, calculating matrix XcXc TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2, ...qc- 1, calculating matrix XmoXmo TEigenvalue λmoWith corresponding feature vector vmo, calculating matrix XmcXmc TEigenvalue λmcWith Corresponding feature vector vmc
Step D2: feature vector is calculated separatelyFeature vectorFeature vector umo, feature vector umc, In
As another preferred embodiment, calculating matrix in the step FCharacteristic value { λ(k)And it is corresponding Feature vector { u(k)Method are as follows:
Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small Obtain { λ(k)And corresponding feature vector { v(k), wherein k=1,2 ... q;
According toMatrix is calculatedCharacteristic value { λ(k)Corresponding feature vector { u(k)}。
Positive sample set matrix S as another preferred embodiment, after the centralizationoForNegative sample set matrix S after the centralizationcAre as follows:It is describedFor positive class sample, wherein i=1,2 ... qo, It is describedFor negative class sample, wherein j=1,2 ... qc
As another preferred embodiment, the mean vector of the asymmetric sample data of higher-dimension isThe mean vector of the feminine gender class sample isThe feminine gender class sample This mean vector isIt is describedFor positive class sample, wherein i=1,2 ... qo, described For negative class sample, wherein j=1,2 ... qc
As another preferred embodiment, the dimension n and the asymmetric sample of higher-dimension of the asymmetric sample data of higher-dimension The total number of samples amount q of notebook data meets: n >=3q.
As another preferred embodiment, the asymmetric sample data of higher-dimension is image data, gene expression number According to or genome-wide association study data.
As another preferred embodiment, each data element in the higher-dimension asymmetric data is real number.
Technical solution of the present invention is described in detail below.
The present invention is directed higher-dimension asymmetric data, which is made of positive sample and negative sample, specifically, leads to Cross total sample that the dimension of higher-dimension asymmetric data analyzed and got in the available present invention is n, higher-dimension asymmetric data Quantity is q, the sample size of positive sample is qo, negative sample sample size qc, then q=qo+qc, due to for asymmetric number According to, therefore qo≠qc, additionally, due to being high dimensional data, then n > > q, symbol " > > " expression are much larger than, and generally, higher-dimension is non-right The dimension n of data is claimed at least to should be 3 times of total number of samples amount q of higher-dimension asymmetric data, i.e. n >=3q.
Specifically, for the positive sample in higher-dimension asymmetric dataBy qoThe sample group that a row vector indicates At, wherein footnote o indicates positive class sample,I-th of sample in expression positive sample, i=1,2 ... qo,'s Mean vector MoAre as follows:
Class conditional covariance matrix ΣoAre as follows:
For the negative sample in higher-dimension asymmetric dataBy qcThe sample that a row vector indicates forms, wherein foot Marking c indicates negative class sample,J-th of sample in expression negative sample, j=1,2 ... qc, same using the above method It can solve to obtainMean vector McWith class conditional covariance matrix Σc
Higher-dimension asymmetric data is the union of above-mentioned two classes sample, i.e.,It can equally solve to obtain its mean value Vector M solves to obtain the higher-dimension asymmetric data X of centralization by mean vector M, specifically,
PCA in the prior art is to utilize the lesser matrix XX of sizeTIndirectly to total population scatter matrix ΣtCarry out feature It decomposes, the corresponding feature vector composition dimensionality reduction matrix of the maximum preceding m characteristic value of value, then any one n dimension data is passed through The dimensionality reduction matrix is projected, and dimension is down to m dimension, wherein total population scatter matrix ΣtAs shown in formula (3), between class scatter matrix ΣmAs shown in formula (4).
JDPCA method provided by the invention is to ΣoAnd ΣcIt is assigned to new weight, forms new covariance matrix ΣαInstead of ΣtFeature decomposition is carried out, its characteristic value and feature vector, covariance matrix Σ are solvedαAs shown in formula (5),
ΣαoΣocΣcm (5)
Due to ΣαIn the weights of two class conditional covariance matrixs become αo=(qc/q)、αc=(qo/ q), it is no longer two The estimated value of the prior probability of a class, so meeting equation'sCentralization cannot be passed through as PCA Higher-dimension asymmetric data X is directly acquired.In order to solve to obtain the condition of satisfactionThe present invention, which passes through, finds a matrix U, so that U can be to composition ΣαAll matrixes carry out diagonalization simultaneously.After diagonalization is realized, then by after matrix U and diagonalization Diagonal matrix constructs matrixIt is generated since entire calculating process does not have the matrix that size is more than n × n, so JDPCA Computation complexity will substantially reduce.However, existing Joint diagonalization method is approximate algorithm, it usually needs iteration or meter of inverting It calculates, if JDPCA directlys adopt these algorithms, the information that not only PCA is extracted can be warped, but also its calculation amount will increase.For This, the low-rank and real symmetry characteristic of above-mentioned covariance matrix is dexterously utilized in the present invention, devises a kind of fast and accurately new Non-orthogonal joint diagonalization algorithm find matrixMake JDPCA that dimension disaster will not occur when handling high dimensional data to ask Topic.
Conventional Joint diagonalization problem can be described as following form: for the matrix A of L n × n size1, A2...AL, find U and L corresponding diagonal matrix Λ of a diagonalizable matrix1、Λ2、...ΛL, so as to arbitrary l ∈ 1,2, 3...L } it is all satisfied Al=U ΛlUH.Matrix due to needing Joint diagonalization in the present invention is real symmetric matrix, then conjugation turns It sets " H " and is written as transposition " T " without exception.According to formula (3), formula (4), formula (5), enable: αoΣoo, αcΣcc, αc(Mo-M)(Mo- M)Tmo, αo(Mc-M)(Mc-M)Tmc, then ΣαIt can indicate are as follows:
Σαocmomc (6)
It is an object of the present invention to find a matrix U and 4 diagonal matrix Λo、Λc、Λmo、Λmc, enable matrix U same When Σo、Σc、Σmo、ΣmcThis four square matrixes diagonally turn to corresponding diagonal matrix, ΣαFollowing form can be decomposed into:
Σα=U ΛoUT+UΛcUT+UΛmoUT+UΛmcUT=U (Λocmomc)UT (7)
Then, the matrix in formula (7)The matrix exactly of the invention foundIts In, four diagonal matrixs can choose the characteristic value building of corresponding former square matrix, and difficult point is the calculating of diagonalizable matrix U.Due to square Battle array Σo、Σc、Σmo、ΣmcAll there is real symmetry characteristic and size is n × n, and as q < < n, matrix Σo、Σc、Σmo、 ΣmcOrder be respectively qo-1、qc- 1,1,1, it is much smaller than dimension n, and for general real symmetric matrix Ak, diagonalization shape Formula are as follows: Al=U ΛlUT, the transformation in formula has the property that
(a) change ΛlMiddle zero eigenvalue corresponding feature vector in diagonalizable matrix U, this equation are still set up.
(b) Λ is exchanged simultaneouslylIt is middle a pair of characteristic value and matrix U in their corresponding feature vectors position, this equation according to Old establishment.
(c) Λ is directly deletedlIn zero eigenvalue and matrix U in corresponding feature vector content and place column, these Formula is still set up.
(d) in ΛlIn artificial addition zero eigenvalue behind original characteristic value, and the corresponding position addition zero in matrix U Vector, this equation are still set up.
So far, above-mentioned property and singular value decomposition theorem be can use, matrix Σ is found outo、Σc、Σmo、ΣmcAll spies Value indicative and feature vector, and using their low-rank characteristic, change the position of their characteristic values and feature vector, by one of them The corresponding feature vector of part zero eigenvalue of matrix is converted to other corresponding feature vectors of three matrix non-zero characteristic values, most The matrix U of the condition of satisfaction can be pieced together out eventually.Matrix is constructed with the diagonal matrix after matrix U and diagonalizationAfterwards, pass through matrix Construct dimensionality reduction matrix Φm, higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected to obtain the classification information after dimensionality reduction.
[beneficial effect]
Compared with prior art, technical solution proposed by the present invention has the advantage that
(1) when two class sample unbalanced to quantity carries out dimensionality reduction, the redundancy of usual minority class sample is than most classes Redundancy it is more unstable, if these redundancies reject it is inadequate, can classification when lead to serious over-fitting. Most class samples are since training sample is more, so its redundancy is relatively reliable and stable, wherein also including a part of believable two Class different information.For this purpose, the present invention increases the dynamics for rejecting the unstable redundancy of minority class sample, it is most to reduce rejecting The dynamics of class sample redundancy.The principal component that the present invention remains is " principal component for best embodying the difference of two classes ".Therefore, Two classes sample data unbalanced for quantity, each principal component extracted through the invention is in the otherness for distinguishing two class samples Can be more more obvious than traditional PCA, and equally there is orthogonality between these principal components, it is irrelevant each other.
(2) due to entire calculating process of the invention do not have size be more than n × n matrix generate with operation (wherein n is The original dimensions of data), so computation complexity of the invention is greatly diminished.When handling two class sample data of higher-dimension, Calculated result accuracy of the present invention is high, the speed of service is fast, stability is good.
Detailed description of the invention
The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention Figure.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing, to of the invention specific Embodiment carries out clear, complete description, it is clear that and described embodiment is a part of the embodiments of the present invention, rather than Whole embodiments, nor limitation of the present invention.Based on the embodiment of the present invention, those of ordinary skill in the art are not paying Every other embodiment obtained, belongs to protection scope of the present invention under the premise of creative work.
The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention one Figure.As shown in Figure 1, the method comprising the steps of S11 to step S19, is separately below described in detail above-mentioned steps.
Step S11: higher-dimension asymmetric data is obtained.
Specifically, higher-dimension asymmetric data is made of positive sample and negative sample, and analysis obtains higher-dimension asymmetric data Dimension n, the total number of samples amount q of higher-dimension asymmetric data, positive sample sample size qo, negative sample sample size qc, The dimension m of classification information to be extracted is set.In the present embodiment,For positive class sample, wherein i=1,2 ... qo,For negative class sample, wherein j=1,2 ... qc
Step S12: mean vector is calculated, the positive sample set matrix and negative sample set square of centralization is calculated Battle array.
Specifically, the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, it is negative The mean vector M of class samplec, centralization positive sample and negative sample obtain the positive sample set matrix after centralization respectively So, negative sample set matrix S after centralizationc.The method for solving of each mean vector describes in summary of the invention.This reality It applies in example, the positive sample set matrix S after centralizationoForIn Negative sample set matrix S after the heartcAre as follows:
Step S13: matrix X is constructed respectivelyo、Xc、Xmo、Xmc
Specifically,αo=qc/q、αc=qo/q、Matrix XoSize be qo× n, it obviously meets Xo TXo= Σo, similarly know matrix XcEqually meet Xc TXcc;Matrix XmoSize be 1 × n, it obviously meetMatrix X known to similarlymcIt is same to meet
Step S14: matrix X is calculated separatelyoXo T、XcXc T、XmoXmo T、XmcXmc TNonzero eigenvalue and corresponding feature to Amount.
Specifically, matrix XoXo TSize be qo×qo, matrix X is calculatedoXo TNonzero eigenvalueAnd correspondence Feature vector Indicate XoXo TI-th of nonzero eigenvalue,Indicate XoXo TI-th of nonzero eigenvalue pair The feature vector answered, wherein i=1,2 ... qo-1.Matrix XcXc TSize be qc×qc, matrix X is calculatedcXc TNon-zero Characteristic valueWith corresponding feature vectorWherein j=1,2 ... qc- 1,Indicate XcXc TJ-th of non-zero it is special Value indicative,Indicate XcXc TThe corresponding feature vector of j-th of nonzero eigenvalue.Matrix XmoXmo TSize be 1 × 1, feature Value is itself, is denoted as λmo, corresponding feature vector is denoted as vmo.Matrix XmcXmc TSize be 1 × 1, characteristic value be its Body is denoted as λmc, corresponding feature vector is denoted as vmc
Step S15: solution obtains matrix Xo TXo、Xc TXc、Xmo TXmo、Xmc TXmcNonzero eigenvalue and corresponding feature to Amount.
Step S15 utilizes singular value decomposition theorem, i.e. Xo TXoWith XoXo TNonzero eigenvalue is identical, and feature vector, which exists, fixes The property of corresponding relationship solves Xo TXoFeature vector, can similarly solve to obtain Xc TXc、Xmo TXmo、Xmc TXmcNon-zero characteristics Value and corresponding feature vector.Specifically, feature vector is calculated according to the following formulaFeature vectorFeature to Measure umo, feature vector umc, i.e.,
Step S16: diagonalizable matrix U and diagonal matrix Λ are pieced together out.
In step S16, the diagonalizable matrix that four sizes are n × q and the diagonal matrix that four sizes are q × q are constructed first:
According to the property (b) of diagonalization algorithm, (c), (d), it is not difficult to verify its satisfaction: Σmo=UmoΛmoUmo T, Σmc=UmcΛmcUmc T
Then combination obtains diagonalizable matrix U and diagonal matrix Λ:
It, can be to forming Σ with validation matrix U according to the property (a) of diagonalization algorithmαFour square matrix Σo、Σc、 Σmo、ΣmcJoint diagonalization is carried out, diagonalization result is exactly Λo、Λc、Λmo、Λmc
Step S17: building matrixCalculating matrixCharacteristic value and corresponding feature vector.
Construct matrix According to formula (7) it is found thatIt sets up.Calculating matrix's Characteristic value { λ(k)And corresponding feature vector { v(k), characteristic value { λ(k)Arrange from big to small, wherein k=1,2 ... q.
Step S18: calculating matrixCharacteristic value and corresponding feature vector.
Specifically, according to singular value decomposition theorem, by formulaCalculate each characteristic value { λ(k)} Corresponding feature vector { u(k)}。
Step S19: combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data is projected to obtain dimensionality reduction by dimensionality reduction matrix Classification information afterwards.
Specifically, { u is taken(k)In the corresponding combination of eigenvectors of preceding m characteristic value at dimensionality reduction matrix Φm, by step S11 The higher-dimension asymmetric data of middle acquisition passes through dimensionality reduction matrix ΦmIt is projected to obtain the classification information after dimensionality reduction.
The classification information that application method provided in an embodiment of the present invention carries out two class sample datas below extracts experiment.In order to From different dimension scales and two different class sample proportions in terms of the two to the performance of the embodiment of the present invention carry out verifying and Compare, carry out experiment using two groups of data, is referred to as group A data and group B data.Wherein, group A data are for verifying the present invention Accuracy rate and arithmetic speed of the embodiment under different dimensions data;Group B data is for verifying the embodiment of the present invention unbalanced Class resolution capability under sample.Every group of data all include positive sample and negative sample, and experimental data is described as follows.
Group A data: extracting different dimensions data classification information performance to achieve the purpose that verify the embodiment of the present invention, raw At group A data.In group A data, positive sample and negative sample number are respectively set to 500, and therefore, total number of samples is 1000.Wherein, the mean value perseverance of all dimensions of positive sample is 0, and the variance of i-th of dimension is 1/i0.5.The mean value of negative sample is non- Zero and different dimensions mean value it is different, the mean value of j-th of dimension is 1/ (8j)0.25, variance is 1/ (50j)0.25.Design two in this way The reasons why class sample, is as follows: (1) guaranteeing that the mean value of two classes and variance all have differences, it is comprehensive to guarantee that difference has;(2) two The big dimension variance difference of class mean value difference is also big (being concentrated mainly on preceding 20 dimension), the small dimension variance difference of class mean value difference Small, i.e., excessively apparent separation trend will not be presented in dimension difference and variance difference in whole dimension, so that each dimension All correct classification is contributed, classification accuracy can also increase when guaranteeing that total dimension increases;(3) two classes are the same as one-dimensional Mean value difference and variance difference on degree is all not too big, so that relying only on some or certain dimensions are difficult to separate two classes, allows Various methods guarantee certain accuracy when identifying two classes, but cannot easily reach 100%.Total dimension n of group A data N=10000 is risen to by n=1500 for step-length with 500, thus obtains multiple data that property is identical, dimension is different.
Group B data: the data are the face downloaded from MIT facial image database and non-face image data, and selection is wherein 1000 open exhibition experiment.In experiment, fixed total number of samples (1000) are constant, participate in trained facial image (positive) sample ratio Example is changed to 5% by 50%, i.e., is gradually reduced to 45 from 450.Two class sample numbers change into lack of balance shape from equilibrium state Thus state obtains multiple data that dimension is identical, two class sample numbers are different.
With the embodiment of the present invention after extracting classification information in above-mentioned data, then use improved support vector machine (ODR-BSMOTE-SVM, abbreviation OB-SVM) carries out sample classification according to the classification information of extraction.In OB-SVM classifier, core Function is fixed as gaussian kernel function, takes balance parameters α=0.9 of ODR and BSMOTE.The verification method of all experiments is ten foldings Cross validation, classifying quality average sensitivity (Sensitivity, abbreviation Sen), specificity (Specificity, abbreviation Spe it) is assessed with accuracy rate (Accuracy, abbreviation Acc).Enabling FP is negative sample mistake to be divided into positive number, and FN is Positive sample mistake is divided into the number of negative sample;TP and TN respectively indicates the number that positive sample and negative sample are correctly classified Mesh, then sensibility, specificity and accuracy rate are defined as follows.
Sensitivity=TP/ (TP+FN)
Specificity=TN/ (FP+TN)
Accuracy=(TP+TN)/(TP+TN+FP+TN)
Experiment one: performance and arithmetic speed verifying under different dimensions.
Experiment one is using group A data.The total number of samples amount q=900 for participating in training every time is (positive because being the verifying of ten foldings Each 450) with negative sample, test sample is 100 (positive and each 50) of negative sample.Total dimension is step-length by n=1500 with 500 Rise to n=10000.In the group A data of different dimensions, JDPCA, APCA and PCA method are executed respectively, then carries out OB- Svm classifier.Wherein, dimensionality reduction parameter m is fixed as 50, it is to be appreciated that dimensionality reduction parameter m, that is, the dimension m of classification information to be extracted. The average classification performance and calculating time that ten folding cross validations obtain under each dimension are as shown in table 1.Due to qo=qc=450, Two class sample sizes are balanced, so the covariance matrix Σ that JDPCA and APCA are acquiredαThe total population scatter matrix Σ acquired with PCAt Feature decomposition result there is no difference, the average sensitivity of three kinds of methods, specificity, accuracy rate are identical with AUC.Therefore, Table 1 only lists a kind of above-mentioned performance number of method.Rear the three of table 1 are classified as three kinds of methods in the group A data of processing different dimensions When required average workout times, OM therein indicates that memory overflows (Out of Memory, OM), i.e. operation can not continue, Method is forced to interrupt.
Classification performance and calculating time under 1 different dimensions data of table based on JDPCA, APCA and PCA classification information
Seen from table 1, when total dimension n rises to 10000 by 1500, with the accuracy rate phase of the lower three kinds of methods of dimension Together, but with the increase of dimension, accuracy rate slowly rises to 100% by 91.7%.However, the operation time of three kinds of methods is poor It is different very big.The operation time of PCA and JDPCA linearly increases, dimension n it is every increase by 500 when, PCA operation time balanced growth Operation time balanced growth 5s of 1.2s, JDPCA or so.And the operation time of APCA is increased in square formula, reaches 9500 in n When just have occurred and that memory overflow.Since the present embodiments relate to more intermediate variable and its corresponding arithmetic operation, institutes With when dimension n is lower, JDPCA does not show advantage compared to APCA on calculating the time, but when dimension is more than 2500 Afterwards, the advantage of arithmetic speed just becomes apparent upon.When facing high dimensional data, the dimension disaster problem of APCA becomes tight Weight, computation complexity steeply rise, and overflow and can not calculate so that memory occurs when data dimension reaches 9500, and of the invention This problem is not present in the JDPCA of proposition, this is because the Joint diagonalization algorithm of design of the embodiment of the present invention has avoided n × n The generation and calculating (the wherein original dimensions that n is data) of big minor matrix, and be not related to it is any invert and the complex operations such as iteration, Its ΣαCharacteristic value and feature vector calculate more quick and precisely.For larger-sized data, the computation complexity of APCA is just It is bigger, cause arithmetic speed excessively it is slow possibly even can not operation, this is difficult to meet the need that current each field needs to handle big data It asks.Although the present invention is fast not as good as the speed of PCA, compared with APCA, operation time is greatly lowered.
Experiment two: the class resolution capability verifying under unbalanced data.
Experiment two is using group B data.In order to eliminate it is different calculating on different dimensions computation complexity difference to the experiment Influence, the dimensional standard of all images is turned into 45 × 45 (n=2025).Participate in the total number of samples q=900 of training.Wherein, Facial image sample proportion (Proportion, P) is changed to 5% by 50%, i.e., is gradually reduced to 45 by 450.Dimensionality reduction ginseng Number m is fixed as 50.Ten folding cross validation results are as shown in table 2.
Classification performance based on JDPCA, APCA and PCA classification information under 2 different faces image scaled of table
As can be seen from Table 2, tri- kinds of methods of JDPCA, APCA and PCA in the MIT face database unequal sample numbers according to when Specificity and accuracy rate difference are smaller, and wherein JDPCA and APCA causes final result not have since the theoretical value of calculated result is identical It is variant.The difference of three is mainly reflected in sensibility.Obviously, gradually decreasing with face sample proportion, the sensitivity of PCA Property declines to a great extent, and fall ratio JDPCA and APCA is bigger.When face sample only accounts for the 5% of population sample, PCA's is quick Low up to the 6% of perception ratio JDPCA and APCA.It can be seen that JDPCA provided in an embodiment of the present invention remains APCA in face of not Class resolving power advantage under equalization data collection.
Therefore, JDPCA proposed by the present invention has both operation speed when extracting the classification information of higher-dimension, unequal sample numbers evidence The advantage that degree is fast and accuracy rate is high has extensive practical in fields such as communication, radar and biomedicine signals/image procossings Application value.
From above embodiments and related experiment can be seen that the embodiment of the present invention increase reject minority class sample it is unstable The dynamics of redundancy reduces the dynamics for rejecting most class sample redundancies, and in addition the embodiment of the present invention remains Principal component is to best embody the principal component of two classes difference, therefore, two classes sample data unbalanced for quantity, and through the invention Embodiment extract each principal component distinguish two class samples otherness on can it is more more obvious than traditional PCA, and these it is main at / equally have orthogonality, it is irrelevant each other.In addition, the entire calculating process due to the embodiment of the present invention will not Having size is more than the matrix generation and operation (the wherein original dimensions that n is data) of n × n, so the calculating of the embodiment of the present invention Complexity is greatly diminished.When handling two class sample data of higher-dimension, calculated result of embodiment of the present invention accuracy is high, runs Speed is fast and stability is good, and the calculating error of tradition APCA method is larger and is easy to happen calculation amount spilling.

Claims (8)

1. a kind of image classification method is applied to the asymmetric situation of sample, it is characterised in that include the following steps:
Image to be classified is inputted, the image to be classified includes asymmetric positive sample and negative sample, diagonal according to joint Change principal component analytical method and extract classification information from positive sample and negative sample, comprising:
Step A: the higher-dimension asymmetric data being made of positive sample and negative sample is obtained, it is asymmetric that analysis obtains the higher-dimension The dimension n of data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample sample size qo, the feminine gender The sample size q of samplec, the dimension m of classification information to be extracted is set;
Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, negative class sample Mean vector Mc, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectivelyo, center Negative sample set matrix S after changec
Step C: matrix X is constructed respectivelyo, matrix Xc, matrix Xmo, matrix Xmc, whereinαo=qc/q、αc=qo/q、Step D: square is calculated Battle array Xo TXoNonzero eigenvalueWith corresponding feature vectorMatrix Xc TXcNonzero eigenvalueAnd correspondence Feature vectorXmo TXmoEigenvalue λmoWith corresponding feature vector umo, matrix Xmc TXmcEigenvalue λmcAnd correspondence Feature vector umc
Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to feature vector obtained in step D, and construct matrix Wherein
Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small Column obtain { λ(k)And corresponding feature vector { u(k), wherein k=1,2 ... q takes { u(k)In the corresponding spy of preceding m characteristic value Sign vector is combined into dimensionality reduction matrix Φm, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected after obtaining dimensionality reduction Classification information;
Classified according to classification information to the image to be classified;
The step D is specifically included:
Step D1: calculating matrix XoXo TNonzero eigenvalueWith corresponding feature vectorWherein i=1,2, ...qo- 1, calculating matrix XcXc TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2 ... qc- 1, calculating matrix XmoXmo TEigenvalue λmoWith corresponding feature vector vmo, calculating matrix XmcXmc TEigenvalue λmcWith it is corresponding Feature vector vmc
Step D2: feature vector is calculated separatelyFeature vectorFeature vector umo, feature vector umc, wherein
2. image classification method according to claim 1, it is characterised in that:
The image to be classified is face and inhuman face image, or is disease sample and non-disease sample.
3. image classification method according to claim 1, it is characterised in that calculating matrix in the step FSpy Value indicative { λ(k)And corresponding feature vector { u(k)Method are as follows:
Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange obtain from big to small {λ(k)And corresponding feature vector { v(k), wherein k=1,2 ... q;
According toMatrix is calculatedCharacteristic value { λ(k)Corresponding feature vector { u(k)}。
4. image classification method according to claim 1 or 2 or 3, it is characterised in that: the positive sample after the centralization Gather matrix SoForNegative sample set square after the centralization Battle array ScAre as follows:It is describedFor positive class sample, wherein i=1, 2,...qo, describedFor negative class sample, wherein j=1,2 ... qc
5. image classification method according to claim 1 or 2 or 3, it is characterised in that: the asymmetric sample data of higher-dimension Mean vector beThe mean vector of the positive class sample isIt is described The mean vector of negative class sample isIt is describedFor positive class sample, wherein i=1,2 ... qo, institute It statesFor negative class sample, wherein j=1,2 ... qc
6. image classification method according to claim 1 or 2 or 3, it is characterised in that the asymmetric sample data of higher-dimension Dimension n and the asymmetric sample data of higher-dimension total number of samples amount q meet: n >=3q.
7. image classification method according to claim 6, it is characterised in that the asymmetric sample data of higher-dimension is image Data, gene expression data or genome-wide association study data.
8. image classification method according to claim 1 or 2 or 3, it is characterised in that in the higher-dimension asymmetric data Each data element is real number.
CN201510251168.4A 2015-05-18 2015-05-18 The method of classification information is extracted from higher-dimension asymmetric data Active CN105005783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510251168.4A CN105005783B (en) 2015-05-18 2015-05-18 The method of classification information is extracted from higher-dimension asymmetric data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510251168.4A CN105005783B (en) 2015-05-18 2015-05-18 The method of classification information is extracted from higher-dimension asymmetric data

Publications (2)

Publication Number Publication Date
CN105005783A CN105005783A (en) 2015-10-28
CN105005783B true CN105005783B (en) 2019-04-23

Family

ID=54378448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510251168.4A Active CN105005783B (en) 2015-05-18 2015-05-18 The method of classification information is extracted from higher-dimension asymmetric data

Country Status (1)

Country Link
CN (1) CN105005783B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980773B (en) * 2017-05-27 2023-06-06 重庆大学 Gas-liquid diphase detection system data fusion method based on artificial smell-taste technology
CN107392259B (en) * 2017-08-16 2021-12-07 北京京东尚科信息技术有限公司 Method and device for constructing unbalanced sample classification model
CN108106500B (en) * 2017-12-21 2020-01-14 中国舰船研究设计中心 Missile target type identification method based on multiple sensors
CN112685509B (en) * 2020-12-29 2022-08-02 通联数据股份公司 High-dimensional data collaborative change amplitude identification method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156885B (en) * 2010-02-12 2014-03-26 中国科学院自动化研究所 Image classification method based on cascaded codebook generation
CN103035050B (en) * 2012-12-19 2015-05-20 南京师范大学 High-precision face recognition method for complex face recognition access control system
CN103218625A (en) * 2013-05-10 2013-07-24 陆嘉恒 Automatic remote sensing image interpretation method based on cost-sensitive support vector machine
CN103679132B (en) * 2013-07-15 2016-08-24 北京工业大学 A kind of nude picture detection method and system
CN103531205B (en) * 2013-10-09 2016-08-31 常州工学院 The asymmetrical voice conversion method mapped based on deep neural network feature
CN104616013A (en) * 2014-04-30 2015-05-13 北京大学 Method for acquiring low-dimensional local characteristics descriptor
CN103927530B (en) * 2014-05-05 2017-06-16 苏州大学 The preparation method and application process, system of a kind of final classification device

Also Published As

Publication number Publication date
CN105005783A (en) 2015-10-28

Similar Documents

Publication Publication Date Title
CN108564129B (en) Trajectory data classification method based on generation countermeasure network
CN108830209B (en) Remote sensing image road extraction method based on generation countermeasure network
Yi et al. Age estimation by multi-scale convolutional network
Menardi et al. Training and assessing classification rules with imbalanced data
CN109145921A (en) A kind of image partition method based on improved intuitionistic fuzzy C mean cluster
CN108229298A (en) The training of neural network and face identification method and device, equipment, storage medium
CN105005783B (en) The method of classification information is extracted from higher-dimension asymmetric data
CN106778853A (en) Unbalanced data sorting technique based on weight cluster and sub- sampling
WO2022126810A1 (en) Text clustering method
CN110188708A (en) A kind of facial expression recognizing method based on convolutional neural networks
Beikmohammadi et al. SWP-LeafNET: A novel multistage approach for plant leaf identification based on deep CNN
CN103914705A (en) Hyperspectral image classification and wave band selection method based on multi-target immune cloning
Gragnaniello et al. Biologically-inspired dense local descriptor for indirect immunofluorescence image classification
CN114492768A (en) Twin capsule network intrusion detection method based on small sample learning
CN104156690A (en) Gesture recognition method based on image space pyramid bag of features
CN110059568A (en) Multiclass leucocyte automatic identifying method based on deep layer convolutional neural networks
Çuğu et al. Treelogy: A novel tree classifier utilizing deep and hand-crafted representations
CN107578063B (en) Image Spectral Clustering based on fast selecting landmark point
CN102609715B (en) Object type identification method combining plurality of interest point testers
US20240054639A1 (en) Quantification of conditions on biomedical images across staining modalities using a multi-task deep learning framework
Zhang et al. Discriminative tensor sparse coding for image classification.
Rhee Improvement feature vector: Autoregressive model of median filter residual
CN102609733B (en) Fast face recognition method in application environment of massive face database
CN108776809A (en) A kind of dual sampling Ensemble classifier model based on Fisher cores
Sun et al. A compositional feature embedding and similarity metric for ultra-fine-grained visual categorization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant