CN105005783B - The method of classification information is extracted from higher-dimension asymmetric data - Google Patents
The method of classification information is extracted from higher-dimension asymmetric data Download PDFInfo
- Publication number
- CN105005783B CN105005783B CN201510251168.4A CN201510251168A CN105005783B CN 105005783 B CN105005783 B CN 105005783B CN 201510251168 A CN201510251168 A CN 201510251168A CN 105005783 B CN105005783 B CN 105005783B
- Authority
- CN
- China
- Prior art keywords
- matrix
- sample
- dimension
- data
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Abstract
The present invention relates to signals and field of image processing, a kind of method for extracting classification information from higher-dimension asymmetric data is provided, to solve existing relevant classification information extracting method or the asymmetric data of unsuitable sample, it is easy to happen the problem of calculation amount is overflowed when calculating complicated height, processing high dimensional data, this method comprises: obtaining higher-dimension asymmetric data;To ΣoAnd ΣcIt is assigned to new weight, forms new covariance matrix ΣαInstead of ΣtFeature decomposition is carried out, its characteristic value and feature vector are solved;Combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data by dimensionality reduction matrix is projected to obtain the classification information after dimensionality reduction.Technical solution computation complexity proposed by the present invention is low, accuracy is high, the speed of service is fast, stability is good.
Description
Technical field
The present invention relates to signal and field of image processing, in particular to one kind extracts classification letter from higher-dimension asymmetric data
The method of breath.
Background technique
The method that classification information is extracted from two class sample datas has highly important practical application value.For example, with
The classification information being extracted distinguishes face and inhuman face image, distinguishes disease sample and non-disease sample and identifies useful letter
Breath and garbage etc..As the technology and means that obtain information are increasingly advanced, the two class data dimensions for needing to classify are more and more
It is huge, along with two class sample sizes of acquisition are usually unbalanced, so that two traditional class sample classification methods are by compared with the day of one's doom
System.Therefore, there is an urgent need to one kind classification information can be extracted from higher-dimension, the two asymmetric big datas of class sample number method,
To meet the needs of massive information society every field development.
Principal component analysis (principal component analysis, PCA) is a kind of most common non-supervisory
Formula Multielement statistical analysis method, this method are mainly to carry out signature analysis to the covariance matrix of data set, are reconstructed minimizing
The main component in data is isolated under conditions of error, as classification information.PCA has simplified data capability by force and realization is difficult
Spend lower feature.However, when PCA is when facing unbalanced sample, although it can will be reconstructed in principal component space
Information maximizes, but cannot be effectively maintained the information for being conducive to classification, this will lead under the classification performance of entire application system
Drop.The arch-criminal that interference PCA correctly classifies is: when the sample size of a kind of data (referred to as positive sample) is less than another kind of data
When the sample size of (referred to as negative sample), the corresponding feature vector of positive class conditional covariance matrix small feature value can occur sternly
It deviates again.In order to improve the defect of PCA, a kind of asymmetric PCA (Asymmetric Principal Component
Analysis, referred to as APCA) method is suggested.APCA emphasis eliminates the factor that interference PCA correctly classifies, to positive class item
Part covariance matrix and negative class conditional covariance matrix are assigned to new weight, form new covariance matrix instead of the total of PCA
Feature decomposition is carried out after body scatter matrix again.Compared to PCA method, APCA method extracts the classification information ability of lack of balance data
Have a large increase, but it when handling high dimensional data (such as some medical images) it occur frequently that calculation amount spillover.
Reason is that the new covariance matrix of APCA building is formed by multiple square matrix linear combinations having a size of n × n, and wherein n is number
According to original dimensions.In many practical applications, the original dimensions of data are all bigger, for example, one having a size of 200 × 200
Image just have 40000 pixels, i.e., 40000 dimensions, therefore, APCA is when calculating the characteristic value of high dimensional data covariance matrix
It is easy to cause calculator memory to overflow and subsequent calculating can not be continued, that is, allows to calculate, such huge matrix dimension
Also high computation complexity will necessarily be brought, the time is calculated and error all can substantial increase.
It can be seen that existing relevant classification information extracting method or the asymmetric data of unsuitable sample or calculating
Calculation amount spilling is easy to happen when complicated height, processing high dimensional data.
Summary of the invention
[technical problems to be solved]
The purpose of the present invention is being to solve drawbacks described above present in background technique, introducing Joint diagonalization is theoretical,
A kind of method that classification information is extracted from higher-dimension, the two asymmetric big datas of class sample number is had devised and embodied, for the ease of
Illustrate, the method provided by the invention that classification information is extracted from higher-dimension asymmetric data is named as Joint diagonalization principal component
It analyzes (Joint Diagonalization Principal Component Analysis, JDPCA).
[technical solution]
The present invention is achieved by the following technical solutions.
The method that the present invention relates to a kind of to extract classification information from higher-dimension asymmetric data, this method include following step
It is rapid:
Step A: obtaining higher-dimension asymmetric data, and the higher-dimension asymmetric data is made of positive sample and negative sample,
Analysis obtains the dimension n of the higher-dimension asymmetric data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample
This sample size qo, the negative sample sample size qc, the dimension m of classification information to be extracted is set;
Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, negative class
The mean vector M of samplec, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectivelyo、
Negative sample set matrix S after centralizationc;
Step C: matrix X is constructed respectivelyo, matrix Xc, matrix Xmo, matrix Xmc, whereinαo=qc/
q、αc=qo/q、
Step D: calculating matrix Xo TXoNonzero eigenvalueWith corresponding feature vectorMatrix Xc TXcIt is non-
Zero eigenvalueWith corresponding feature vectorXmo TXmoEigenvalue λmoWith corresponding feature vector umo, matrix
Xmc TXmcEigenvalue λmcWith corresponding feature vector umc;
Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to the feature vector calculated in step D, and construct matrixWherein
Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value from big
{ λ is obtained to minispread(k)And corresponding feature vector { u(k), wherein k=1,2 ... q takes { u(k)In preceding m characteristic value pair
The combination of eigenvectors answered is at dimensionality reduction matrix Φm, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected to obtain
Classification information after dimensionality reduction.
As a preferred embodiment, the step D is specifically included:
Step D1: calculating matrix XoXo TNonzero eigenvalueWith corresponding feature vectorWherein i=1,
2,...qo- 1, calculating matrix XcXc TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2,
...qc- 1, calculating matrix XmoXmo TEigenvalue λmoWith corresponding feature vector vmo, calculating matrix XmcXmc TEigenvalue λmcWith
Corresponding feature vector vmc;
Step D2: feature vector is calculated separatelyFeature vectorFeature vector umo, feature vector umc,
In
As another preferred embodiment, calculating matrix in the step FCharacteristic value { λ(k)And it is corresponding
Feature vector { u(k)Method are as follows:
Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small
Obtain { λ(k)And corresponding feature vector { v(k), wherein k=1,2 ... q;
According toMatrix is calculatedCharacteristic value { λ(k)Corresponding feature vector { u(k)}。
Positive sample set matrix S as another preferred embodiment, after the centralizationoForNegative sample set matrix S after the centralizationcAre as follows:It is describedFor positive class sample, wherein i=1,2 ... qo,
It is describedFor negative class sample, wherein j=1,2 ... qc。
As another preferred embodiment, the mean vector of the asymmetric sample data of higher-dimension isThe mean vector of the feminine gender class sample isThe feminine gender class sample
This mean vector isIt is describedFor positive class sample, wherein i=1,2 ... qo, described
For negative class sample, wherein j=1,2 ... qc。
As another preferred embodiment, the dimension n and the asymmetric sample of higher-dimension of the asymmetric sample data of higher-dimension
The total number of samples amount q of notebook data meets: n >=3q.
As another preferred embodiment, the asymmetric sample data of higher-dimension is image data, gene expression number
According to or genome-wide association study data.
As another preferred embodiment, each data element in the higher-dimension asymmetric data is real number.
Technical solution of the present invention is described in detail below.
The present invention is directed higher-dimension asymmetric data, which is made of positive sample and negative sample, specifically, leads to
Cross total sample that the dimension of higher-dimension asymmetric data analyzed and got in the available present invention is n, higher-dimension asymmetric data
Quantity is q, the sample size of positive sample is qo, negative sample sample size qc, then q=qo+qc, due to for asymmetric number
According to, therefore qo≠qc, additionally, due to being high dimensional data, then n > > q, symbol " > > " expression are much larger than, and generally, higher-dimension is non-right
The dimension n of data is claimed at least to should be 3 times of total number of samples amount q of higher-dimension asymmetric data, i.e. n >=3q.
Specifically, for the positive sample in higher-dimension asymmetric dataBy qoThe sample group that a row vector indicates
At, wherein footnote o indicates positive class sample,I-th of sample in expression positive sample, i=1,2 ... qo,'s
Mean vector MoAre as follows:
Class conditional covariance matrix ΣoAre as follows:
For the negative sample in higher-dimension asymmetric dataBy qcThe sample that a row vector indicates forms, wherein foot
Marking c indicates negative class sample,J-th of sample in expression negative sample, j=1,2 ... qc, same using the above method
It can solve to obtainMean vector McWith class conditional covariance matrix Σc。
Higher-dimension asymmetric data is the union of above-mentioned two classes sample, i.e.,It can equally solve to obtain its mean value
Vector M solves to obtain the higher-dimension asymmetric data X of centralization by mean vector M, specifically,
PCA in the prior art is to utilize the lesser matrix XX of sizeTIndirectly to total population scatter matrix ΣtCarry out feature
It decomposes, the corresponding feature vector composition dimensionality reduction matrix of the maximum preceding m characteristic value of value, then any one n dimension data is passed through
The dimensionality reduction matrix is projected, and dimension is down to m dimension, wherein total population scatter matrix ΣtAs shown in formula (3), between class scatter matrix
ΣmAs shown in formula (4).
JDPCA method provided by the invention is to ΣoAnd ΣcIt is assigned to new weight, forms new covariance matrix ΣαInstead of
ΣtFeature decomposition is carried out, its characteristic value and feature vector, covariance matrix Σ are solvedαAs shown in formula (5),
Σα=αoΣo+αcΣc+Σm (5)
Due to ΣαIn the weights of two class conditional covariance matrixs become αo=(qc/q)、αc=(qo/ q), it is no longer two
The estimated value of the prior probability of a class, so meeting equation'sCentralization cannot be passed through as PCA
Higher-dimension asymmetric data X is directly acquired.In order to solve to obtain the condition of satisfactionThe present invention, which passes through, finds a matrix U, so that
U can be to composition ΣαAll matrixes carry out diagonalization simultaneously.After diagonalization is realized, then by after matrix U and diagonalization
Diagonal matrix constructs matrixIt is generated since entire calculating process does not have the matrix that size is more than n × n, so JDPCA
Computation complexity will substantially reduce.However, existing Joint diagonalization method is approximate algorithm, it usually needs iteration or meter of inverting
It calculates, if JDPCA directlys adopt these algorithms, the information that not only PCA is extracted can be warped, but also its calculation amount will increase.For
This, the low-rank and real symmetry characteristic of above-mentioned covariance matrix is dexterously utilized in the present invention, devises a kind of fast and accurately new
Non-orthogonal joint diagonalization algorithm find matrixMake JDPCA that dimension disaster will not occur when handling high dimensional data to ask
Topic.
Conventional Joint diagonalization problem can be described as following form: for the matrix A of L n × n size1,
A2...AL, find U and L corresponding diagonal matrix Λ of a diagonalizable matrix1、Λ2、...ΛL, so as to arbitrary l ∈ 1,2,
3...L } it is all satisfied Al=U ΛlUH.Matrix due to needing Joint diagonalization in the present invention is real symmetric matrix, then conjugation turns
It sets " H " and is written as transposition " T " without exception.According to formula (3), formula (4), formula (5), enable: αoΣo=Σo, αcΣc=Σc, αc(Mo-M)(Mo-
M)T=Σmo, αo(Mc-M)(Mc-M)T=Σmc, then ΣαIt can indicate are as follows:
Σα=Σo+Σc+Σmo+Σmc (6)
It is an object of the present invention to find a matrix U and 4 diagonal matrix Λo、Λc、Λmo、Λmc, enable matrix U same
When Σo、Σc、Σmo、ΣmcThis four square matrixes diagonally turn to corresponding diagonal matrix, ΣαFollowing form can be decomposed into:
Σα=U ΛoUT+UΛcUT+UΛmoUT+UΛmcUT=U (Λo+Λc+Λmo+Λmc)UT (7)
Then, the matrix in formula (7)The matrix exactly of the invention foundIts
In, four diagonal matrixs can choose the characteristic value building of corresponding former square matrix, and difficult point is the calculating of diagonalizable matrix U.Due to square
Battle array Σo、Σc、Σmo、ΣmcAll there is real symmetry characteristic and size is n × n, and as q < < n, matrix Σo、Σc、Σmo、
ΣmcOrder be respectively qo-1、qc- 1,1,1, it is much smaller than dimension n, and for general real symmetric matrix Ak, diagonalization shape
Formula are as follows: Al=U ΛlUT, the transformation in formula has the property that
(a) change ΛlMiddle zero eigenvalue corresponding feature vector in diagonalizable matrix U, this equation are still set up.
(b) Λ is exchanged simultaneouslylIt is middle a pair of characteristic value and matrix U in their corresponding feature vectors position, this equation according to
Old establishment.
(c) Λ is directly deletedlIn zero eigenvalue and matrix U in corresponding feature vector content and place column, these
Formula is still set up.
(d) in ΛlIn artificial addition zero eigenvalue behind original characteristic value, and the corresponding position addition zero in matrix U
Vector, this equation are still set up.
So far, above-mentioned property and singular value decomposition theorem be can use, matrix Σ is found outo、Σc、Σmo、ΣmcAll spies
Value indicative and feature vector, and using their low-rank characteristic, change the position of their characteristic values and feature vector, by one of them
The corresponding feature vector of part zero eigenvalue of matrix is converted to other corresponding feature vectors of three matrix non-zero characteristic values, most
The matrix U of the condition of satisfaction can be pieced together out eventually.Matrix is constructed with the diagonal matrix after matrix U and diagonalizationAfterwards, pass through matrix
Construct dimensionality reduction matrix Φm, higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected to obtain the classification information after dimensionality reduction.
[beneficial effect]
Compared with prior art, technical solution proposed by the present invention has the advantage that
(1) when two class sample unbalanced to quantity carries out dimensionality reduction, the redundancy of usual minority class sample is than most classes
Redundancy it is more unstable, if these redundancies reject it is inadequate, can classification when lead to serious over-fitting.
Most class samples are since training sample is more, so its redundancy is relatively reliable and stable, wherein also including a part of believable two
Class different information.For this purpose, the present invention increases the dynamics for rejecting the unstable redundancy of minority class sample, it is most to reduce rejecting
The dynamics of class sample redundancy.The principal component that the present invention remains is " principal component for best embodying the difference of two classes ".Therefore,
Two classes sample data unbalanced for quantity, each principal component extracted through the invention is in the otherness for distinguishing two class samples
Can be more more obvious than traditional PCA, and equally there is orthogonality between these principal components, it is irrelevant each other.
(2) due to entire calculating process of the invention do not have size be more than n × n matrix generate with operation (wherein n is
The original dimensions of data), so computation complexity of the invention is greatly diminished.When handling two class sample data of higher-dimension,
Calculated result accuracy of the present invention is high, the speed of service is fast, stability is good.
Detailed description of the invention
The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention
Figure.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing, to of the invention specific
Embodiment carries out clear, complete description, it is clear that and described embodiment is a part of the embodiments of the present invention, rather than
Whole embodiments, nor limitation of the present invention.Based on the embodiment of the present invention, those of ordinary skill in the art are not paying
Every other embodiment obtained, belongs to protection scope of the present invention under the premise of creative work.
The process of the method for classification information is extracted in the slave higher-dimension asymmetric data that Fig. 1 provides for the embodiment of the present invention one
Figure.As shown in Figure 1, the method comprising the steps of S11 to step S19, is separately below described in detail above-mentioned steps.
Step S11: higher-dimension asymmetric data is obtained.
Specifically, higher-dimension asymmetric data is made of positive sample and negative sample, and analysis obtains higher-dimension asymmetric data
Dimension n, the total number of samples amount q of higher-dimension asymmetric data, positive sample sample size qo, negative sample sample size qc,
The dimension m of classification information to be extracted is set.In the present embodiment,For positive class sample, wherein i=1,2 ... qo,For negative class sample, wherein j=1,2 ... qc。
Step S12: mean vector is calculated, the positive sample set matrix and negative sample set square of centralization is calculated
Battle array.
Specifically, the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, it is negative
The mean vector M of class samplec, centralization positive sample and negative sample obtain the positive sample set matrix after centralization respectively
So, negative sample set matrix S after centralizationc.The method for solving of each mean vector describes in summary of the invention.This reality
It applies in example, the positive sample set matrix S after centralizationoForIn
Negative sample set matrix S after the heartcAre as follows:
Step S13: matrix X is constructed respectivelyo、Xc、Xmo、Xmc。
Specifically,αo=qc/q、αc=qo/q、Matrix XoSize be qo× n, it obviously meets Xo TXo=
Σo, similarly know matrix XcEqually meet Xc TXc=Σc;Matrix XmoSize be 1 × n, it obviously meetMatrix X known to similarlymcIt is same to meet
Step S14: matrix X is calculated separatelyoXo T、XcXc T、XmoXmo T、XmcXmc TNonzero eigenvalue and corresponding feature to
Amount.
Specifically, matrix XoXo TSize be qo×qo, matrix X is calculatedoXo TNonzero eigenvalueAnd correspondence
Feature vector Indicate XoXo TI-th of nonzero eigenvalue,Indicate XoXo TI-th of nonzero eigenvalue pair
The feature vector answered, wherein i=1,2 ... qo-1.Matrix XcXc TSize be qc×qc, matrix X is calculatedcXc TNon-zero
Characteristic valueWith corresponding feature vectorWherein j=1,2 ... qc- 1,Indicate XcXc TJ-th of non-zero it is special
Value indicative,Indicate XcXc TThe corresponding feature vector of j-th of nonzero eigenvalue.Matrix XmoXmo TSize be 1 × 1, feature
Value is itself, is denoted as λmo, corresponding feature vector is denoted as vmo.Matrix XmcXmc TSize be 1 × 1, characteristic value be its
Body is denoted as λmc, corresponding feature vector is denoted as vmc。
Step S15: solution obtains matrix Xo TXo、Xc TXc、Xmo TXmo、Xmc TXmcNonzero eigenvalue and corresponding feature to
Amount.
Step S15 utilizes singular value decomposition theorem, i.e. Xo TXoWith XoXo TNonzero eigenvalue is identical, and feature vector, which exists, fixes
The property of corresponding relationship solves Xo TXoFeature vector, can similarly solve to obtain Xc TXc、Xmo TXmo、Xmc TXmcNon-zero characteristics
Value and corresponding feature vector.Specifically, feature vector is calculated according to the following formulaFeature vectorFeature to
Measure umo, feature vector umc, i.e.,
Step S16: diagonalizable matrix U and diagonal matrix Λ are pieced together out.
In step S16, the diagonalizable matrix that four sizes are n × q and the diagonal matrix that four sizes are q × q are constructed first:
According to the property (b) of diagonalization algorithm, (c), (d), it is not difficult to verify its satisfaction: Σmo=UmoΛmoUmo T, Σmc=UmcΛmcUmc T。
Then combination obtains diagonalizable matrix U and diagonal matrix Λ:
It, can be to forming Σ with validation matrix U according to the property (a) of diagonalization algorithmαFour square matrix Σo、Σc、
Σmo、ΣmcJoint diagonalization is carried out, diagonalization result is exactly Λo、Λc、Λmo、Λmc。
Step S17: building matrixCalculating matrixCharacteristic value and corresponding feature vector.
Construct matrix According to formula (7) it is found thatIt sets up.Calculating matrix's
Characteristic value { λ(k)And corresponding feature vector { v(k), characteristic value { λ(k)Arrange from big to small, wherein k=1,2 ... q.
Step S18: calculating matrixCharacteristic value and corresponding feature vector.
Specifically, according to singular value decomposition theorem, by formulaCalculate each characteristic value { λ(k)}
Corresponding feature vector { u(k)}。
Step S19: combination obtains dimensionality reduction matrix, and higher-dimension asymmetric data is projected to obtain dimensionality reduction by dimensionality reduction matrix
Classification information afterwards.
Specifically, { u is taken(k)In the corresponding combination of eigenvectors of preceding m characteristic value at dimensionality reduction matrix Φm, by step S11
The higher-dimension asymmetric data of middle acquisition passes through dimensionality reduction matrix ΦmIt is projected to obtain the classification information after dimensionality reduction.
The classification information that application method provided in an embodiment of the present invention carries out two class sample datas below extracts experiment.In order to
From different dimension scales and two different class sample proportions in terms of the two to the performance of the embodiment of the present invention carry out verifying and
Compare, carry out experiment using two groups of data, is referred to as group A data and group B data.Wherein, group A data are for verifying the present invention
Accuracy rate and arithmetic speed of the embodiment under different dimensions data;Group B data is for verifying the embodiment of the present invention unbalanced
Class resolution capability under sample.Every group of data all include positive sample and negative sample, and experimental data is described as follows.
Group A data: extracting different dimensions data classification information performance to achieve the purpose that verify the embodiment of the present invention, raw
At group A data.In group A data, positive sample and negative sample number are respectively set to 500, and therefore, total number of samples is
1000.Wherein, the mean value perseverance of all dimensions of positive sample is 0, and the variance of i-th of dimension is 1/i0.5.The mean value of negative sample is non-
Zero and different dimensions mean value it is different, the mean value of j-th of dimension is 1/ (8j)0.25, variance is 1/ (50j)0.25.Design two in this way
The reasons why class sample, is as follows: (1) guaranteeing that the mean value of two classes and variance all have differences, it is comprehensive to guarantee that difference has;(2) two
The big dimension variance difference of class mean value difference is also big (being concentrated mainly on preceding 20 dimension), the small dimension variance difference of class mean value difference
Small, i.e., excessively apparent separation trend will not be presented in dimension difference and variance difference in whole dimension, so that each dimension
All correct classification is contributed, classification accuracy can also increase when guaranteeing that total dimension increases;(3) two classes are the same as one-dimensional
Mean value difference and variance difference on degree is all not too big, so that relying only on some or certain dimensions are difficult to separate two classes, allows
Various methods guarantee certain accuracy when identifying two classes, but cannot easily reach 100%.Total dimension n of group A data
N=10000 is risen to by n=1500 for step-length with 500, thus obtains multiple data that property is identical, dimension is different.
Group B data: the data are the face downloaded from MIT facial image database and non-face image data, and selection is wherein
1000 open exhibition experiment.In experiment, fixed total number of samples (1000) are constant, participate in trained facial image (positive) sample ratio
Example is changed to 5% by 50%, i.e., is gradually reduced to 45 from 450.Two class sample numbers change into lack of balance shape from equilibrium state
Thus state obtains multiple data that dimension is identical, two class sample numbers are different.
With the embodiment of the present invention after extracting classification information in above-mentioned data, then use improved support vector machine
(ODR-BSMOTE-SVM, abbreviation OB-SVM) carries out sample classification according to the classification information of extraction.In OB-SVM classifier, core
Function is fixed as gaussian kernel function, takes balance parameters α=0.9 of ODR and BSMOTE.The verification method of all experiments is ten foldings
Cross validation, classifying quality average sensitivity (Sensitivity, abbreviation Sen), specificity (Specificity, abbreviation
Spe it) is assessed with accuracy rate (Accuracy, abbreviation Acc).Enabling FP is negative sample mistake to be divided into positive number, and FN is
Positive sample mistake is divided into the number of negative sample;TP and TN respectively indicates the number that positive sample and negative sample are correctly classified
Mesh, then sensibility, specificity and accuracy rate are defined as follows.
Sensitivity=TP/ (TP+FN)
Specificity=TN/ (FP+TN)
Accuracy=(TP+TN)/(TP+TN+FP+TN)
Experiment one: performance and arithmetic speed verifying under different dimensions.
Experiment one is using group A data.The total number of samples amount q=900 for participating in training every time is (positive because being the verifying of ten foldings
Each 450) with negative sample, test sample is 100 (positive and each 50) of negative sample.Total dimension is step-length by n=1500 with 500
Rise to n=10000.In the group A data of different dimensions, JDPCA, APCA and PCA method are executed respectively, then carries out OB-
Svm classifier.Wherein, dimensionality reduction parameter m is fixed as 50, it is to be appreciated that dimensionality reduction parameter m, that is, the dimension m of classification information to be extracted.
The average classification performance and calculating time that ten folding cross validations obtain under each dimension are as shown in table 1.Due to qo=qc=450,
Two class sample sizes are balanced, so the covariance matrix Σ that JDPCA and APCA are acquiredαThe total population scatter matrix Σ acquired with PCAt
Feature decomposition result there is no difference, the average sensitivity of three kinds of methods, specificity, accuracy rate are identical with AUC.Therefore,
Table 1 only lists a kind of above-mentioned performance number of method.Rear the three of table 1 are classified as three kinds of methods in the group A data of processing different dimensions
When required average workout times, OM therein indicates that memory overflows (Out of Memory, OM), i.e. operation can not continue,
Method is forced to interrupt.
Classification performance and calculating time under 1 different dimensions data of table based on JDPCA, APCA and PCA classification information
Seen from table 1, when total dimension n rises to 10000 by 1500, with the accuracy rate phase of the lower three kinds of methods of dimension
Together, but with the increase of dimension, accuracy rate slowly rises to 100% by 91.7%.However, the operation time of three kinds of methods is poor
It is different very big.The operation time of PCA and JDPCA linearly increases, dimension n it is every increase by 500 when, PCA operation time balanced growth
Operation time balanced growth 5s of 1.2s, JDPCA or so.And the operation time of APCA is increased in square formula, reaches 9500 in n
When just have occurred and that memory overflow.Since the present embodiments relate to more intermediate variable and its corresponding arithmetic operation, institutes
With when dimension n is lower, JDPCA does not show advantage compared to APCA on calculating the time, but when dimension is more than 2500
Afterwards, the advantage of arithmetic speed just becomes apparent upon.When facing high dimensional data, the dimension disaster problem of APCA becomes tight
Weight, computation complexity steeply rise, and overflow and can not calculate so that memory occurs when data dimension reaches 9500, and of the invention
This problem is not present in the JDPCA of proposition, this is because the Joint diagonalization algorithm of design of the embodiment of the present invention has avoided n × n
The generation and calculating (the wherein original dimensions that n is data) of big minor matrix, and be not related to it is any invert and the complex operations such as iteration,
Its ΣαCharacteristic value and feature vector calculate more quick and precisely.For larger-sized data, the computation complexity of APCA is just
It is bigger, cause arithmetic speed excessively it is slow possibly even can not operation, this is difficult to meet the need that current each field needs to handle big data
It asks.Although the present invention is fast not as good as the speed of PCA, compared with APCA, operation time is greatly lowered.
Experiment two: the class resolution capability verifying under unbalanced data.
Experiment two is using group B data.In order to eliminate it is different calculating on different dimensions computation complexity difference to the experiment
Influence, the dimensional standard of all images is turned into 45 × 45 (n=2025).Participate in the total number of samples q=900 of training.Wherein,
Facial image sample proportion (Proportion, P) is changed to 5% by 50%, i.e., is gradually reduced to 45 by 450.Dimensionality reduction ginseng
Number m is fixed as 50.Ten folding cross validation results are as shown in table 2.
Classification performance based on JDPCA, APCA and PCA classification information under 2 different faces image scaled of table
As can be seen from Table 2, tri- kinds of methods of JDPCA, APCA and PCA in the MIT face database unequal sample numbers according to when
Specificity and accuracy rate difference are smaller, and wherein JDPCA and APCA causes final result not have since the theoretical value of calculated result is identical
It is variant.The difference of three is mainly reflected in sensibility.Obviously, gradually decreasing with face sample proportion, the sensitivity of PCA
Property declines to a great extent, and fall ratio JDPCA and APCA is bigger.When face sample only accounts for the 5% of population sample, PCA's is quick
Low up to the 6% of perception ratio JDPCA and APCA.It can be seen that JDPCA provided in an embodiment of the present invention remains APCA in face of not
Class resolving power advantage under equalization data collection.
Therefore, JDPCA proposed by the present invention has both operation speed when extracting the classification information of higher-dimension, unequal sample numbers evidence
The advantage that degree is fast and accuracy rate is high has extensive practical in fields such as communication, radar and biomedicine signals/image procossings
Application value.
From above embodiments and related experiment can be seen that the embodiment of the present invention increase reject minority class sample it is unstable
The dynamics of redundancy reduces the dynamics for rejecting most class sample redundancies, and in addition the embodiment of the present invention remains
Principal component is to best embody the principal component of two classes difference, therefore, two classes sample data unbalanced for quantity, and through the invention
Embodiment extract each principal component distinguish two class samples otherness on can it is more more obvious than traditional PCA, and these it is main at
/ equally have orthogonality, it is irrelevant each other.In addition, the entire calculating process due to the embodiment of the present invention will not
Having size is more than the matrix generation and operation (the wherein original dimensions that n is data) of n × n, so the calculating of the embodiment of the present invention
Complexity is greatly diminished.When handling two class sample data of higher-dimension, calculated result of embodiment of the present invention accuracy is high, runs
Speed is fast and stability is good, and the calculating error of tradition APCA method is larger and is easy to happen calculation amount spilling.
Claims (8)
1. a kind of image classification method is applied to the asymmetric situation of sample, it is characterised in that include the following steps:
Image to be classified is inputted, the image to be classified includes asymmetric positive sample and negative sample, diagonal according to joint
Change principal component analytical method and extract classification information from positive sample and negative sample, comprising:
Step A: the higher-dimension asymmetric data being made of positive sample and negative sample is obtained, it is asymmetric that analysis obtains the higher-dimension
The dimension n of data, the total number of samples amount q of the higher-dimension asymmetric data, the positive sample sample size qo, the feminine gender
The sample size q of samplec, the dimension m of classification information to be extracted is set;
Step B: the mean vector M of the asymmetric sample data of higher-dimension, the mean vector M of positive class sample are calculatedo, negative class sample
Mean vector Mc, centralization positive sample and negative sample obtain the positive sample set matrix S after centralization respectivelyo, center
Negative sample set matrix S after changec;
Step C: matrix X is constructed respectivelyo, matrix Xc, matrix Xmo, matrix Xmc, whereinαo=qc/q、αc=qo/q、Step D: square is calculated
Battle array Xo TXoNonzero eigenvalueWith corresponding feature vectorMatrix Xc TXcNonzero eigenvalueAnd correspondence
Feature vectorXmo TXmoEigenvalue λmoWith corresponding feature vector umo, matrix Xmc TXmcEigenvalue λmcAnd correspondence
Feature vector umc;
Step E: diagonalizable matrix U and diagonal matrix Λ are pieced together out according to feature vector obtained in step D, and construct matrix
Wherein
Step F: calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange from big to small
Column obtain { λ(k)And corresponding feature vector { u(k), wherein k=1,2 ... q takes { u(k)In the corresponding spy of preceding m characteristic value
Sign vector is combined into dimensionality reduction matrix Φm, the higher-dimension asymmetric data is passed through into dimensionality reduction matrix ΦmIt is projected after obtaining dimensionality reduction
Classification information;
Classified according to classification information to the image to be classified;
The step D is specifically included:
Step D1: calculating matrix XoXo TNonzero eigenvalueWith corresponding feature vectorWherein i=1,2,
...qo- 1, calculating matrix XcXc TNonzero eigenvalueWith corresponding feature vectorWherein j=1,2 ... qc-
1, calculating matrix XmoXmo TEigenvalue λmoWith corresponding feature vector vmo, calculating matrix XmcXmc TEigenvalue λmcWith it is corresponding
Feature vector vmc;
Step D2: feature vector is calculated separatelyFeature vectorFeature vector umo, feature vector umc, wherein
2. image classification method according to claim 1, it is characterised in that:
The image to be classified is face and inhuman face image, or is disease sample and non-disease sample.
3. image classification method according to claim 1, it is characterised in that calculating matrix in the step FSpy
Value indicative { λ(k)And corresponding feature vector { u(k)Method are as follows:
Calculating matrixCharacteristic value and corresponding feature vector, by matrixCharacteristic value arrange obtain from big to small
{λ(k)And corresponding feature vector { v(k), wherein k=1,2 ... q;
According toMatrix is calculatedCharacteristic value { λ(k)Corresponding feature vector { u(k)}。
4. image classification method according to claim 1 or 2 or 3, it is characterised in that: the positive sample after the centralization
Gather matrix SoForNegative sample set square after the centralization
Battle array ScAre as follows:It is describedFor positive class sample, wherein i=1,
2,...qo, describedFor negative class sample, wherein j=1,2 ... qc。
5. image classification method according to claim 1 or 2 or 3, it is characterised in that: the asymmetric sample data of higher-dimension
Mean vector beThe mean vector of the positive class sample isIt is described
The mean vector of negative class sample isIt is describedFor positive class sample, wherein i=1,2 ... qo, institute
It statesFor negative class sample, wherein j=1,2 ... qc。
6. image classification method according to claim 1 or 2 or 3, it is characterised in that the asymmetric sample data of higher-dimension
Dimension n and the asymmetric sample data of higher-dimension total number of samples amount q meet: n >=3q.
7. image classification method according to claim 6, it is characterised in that the asymmetric sample data of higher-dimension is image
Data, gene expression data or genome-wide association study data.
8. image classification method according to claim 1 or 2 or 3, it is characterised in that in the higher-dimension asymmetric data
Each data element is real number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510251168.4A CN105005783B (en) | 2015-05-18 | 2015-05-18 | The method of classification information is extracted from higher-dimension asymmetric data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510251168.4A CN105005783B (en) | 2015-05-18 | 2015-05-18 | The method of classification information is extracted from higher-dimension asymmetric data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105005783A CN105005783A (en) | 2015-10-28 |
CN105005783B true CN105005783B (en) | 2019-04-23 |
Family
ID=54378448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510251168.4A Active CN105005783B (en) | 2015-05-18 | 2015-05-18 | The method of classification information is extracted from higher-dimension asymmetric data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105005783B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980773B (en) * | 2017-05-27 | 2023-06-06 | 重庆大学 | Gas-liquid diphase detection system data fusion method based on artificial smell-taste technology |
CN107392259B (en) * | 2017-08-16 | 2021-12-07 | 北京京东尚科信息技术有限公司 | Method and device for constructing unbalanced sample classification model |
CN108106500B (en) * | 2017-12-21 | 2020-01-14 | 中国舰船研究设计中心 | Missile target type identification method based on multiple sensors |
CN112685509B (en) * | 2020-12-29 | 2022-08-02 | 通联数据股份公司 | High-dimensional data collaborative change amplitude identification method and device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156885B (en) * | 2010-02-12 | 2014-03-26 | 中国科学院自动化研究所 | Image classification method based on cascaded codebook generation |
CN103035050B (en) * | 2012-12-19 | 2015-05-20 | 南京师范大学 | High-precision face recognition method for complex face recognition access control system |
CN103218625A (en) * | 2013-05-10 | 2013-07-24 | 陆嘉恒 | Automatic remote sensing image interpretation method based on cost-sensitive support vector machine |
CN103679132B (en) * | 2013-07-15 | 2016-08-24 | 北京工业大学 | A kind of nude picture detection method and system |
CN103531205B (en) * | 2013-10-09 | 2016-08-31 | 常州工学院 | The asymmetrical voice conversion method mapped based on deep neural network feature |
CN104616013A (en) * | 2014-04-30 | 2015-05-13 | 北京大学 | Method for acquiring low-dimensional local characteristics descriptor |
CN103927530B (en) * | 2014-05-05 | 2017-06-16 | 苏州大学 | The preparation method and application process, system of a kind of final classification device |
-
2015
- 2015-05-18 CN CN201510251168.4A patent/CN105005783B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN105005783A (en) | 2015-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108564129B (en) | Trajectory data classification method based on generation countermeasure network | |
CN108830209B (en) | Remote sensing image road extraction method based on generation countermeasure network | |
Yi et al. | Age estimation by multi-scale convolutional network | |
Menardi et al. | Training and assessing classification rules with imbalanced data | |
CN109145921A (en) | A kind of image partition method based on improved intuitionistic fuzzy C mean cluster | |
CN108229298A (en) | The training of neural network and face identification method and device, equipment, storage medium | |
CN105005783B (en) | The method of classification information is extracted from higher-dimension asymmetric data | |
CN106778853A (en) | Unbalanced data sorting technique based on weight cluster and sub- sampling | |
WO2022126810A1 (en) | Text clustering method | |
CN110188708A (en) | A kind of facial expression recognizing method based on convolutional neural networks | |
Beikmohammadi et al. | SWP-LeafNET: A novel multistage approach for plant leaf identification based on deep CNN | |
CN103914705A (en) | Hyperspectral image classification and wave band selection method based on multi-target immune cloning | |
Gragnaniello et al. | Biologically-inspired dense local descriptor for indirect immunofluorescence image classification | |
CN114492768A (en) | Twin capsule network intrusion detection method based on small sample learning | |
CN104156690A (en) | Gesture recognition method based on image space pyramid bag of features | |
CN110059568A (en) | Multiclass leucocyte automatic identifying method based on deep layer convolutional neural networks | |
Çuğu et al. | Treelogy: A novel tree classifier utilizing deep and hand-crafted representations | |
CN107578063B (en) | Image Spectral Clustering based on fast selecting landmark point | |
CN102609715B (en) | Object type identification method combining plurality of interest point testers | |
US20240054639A1 (en) | Quantification of conditions on biomedical images across staining modalities using a multi-task deep learning framework | |
Zhang et al. | Discriminative tensor sparse coding for image classification. | |
Rhee | Improvement feature vector: Autoregressive model of median filter residual | |
CN102609733B (en) | Fast face recognition method in application environment of massive face database | |
CN108776809A (en) | A kind of dual sampling Ensemble classifier model based on Fisher cores | |
Sun et al. | A compositional feature embedding and similarity metric for ultra-fine-grained visual categorization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |