CN110364264A - Medical data collection feature dimension reduction method based on sub-space learning - Google Patents
Medical data collection feature dimension reduction method based on sub-space learning Download PDFInfo
- Publication number
- CN110364264A CN110364264A CN201910546805.9A CN201910546805A CN110364264A CN 110364264 A CN110364264 A CN 110364264A CN 201910546805 A CN201910546805 A CN 201910546805A CN 110364264 A CN110364264 A CN 110364264A
- Authority
- CN
- China
- Prior art keywords
- matrix
- sample
- class
- formula
- high dimensional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of medical data collection feature dimension reduction method based on sub-space learning, method includes the following steps: constructing original high dimensional data matrix X and label column according to medical data collection to be analyzed;Optimization objective function is constructed, its Lagrangian is solved;According to original high dimensional data matrix and label column, global discriminant information and local discriminant information are calculated;Transformed matrix Q is iteratively solved, until objective function convergence or reaches maximum cycle, the data matrix after obtaining dimensionality reduction;According to the transformed matrix training pattern acquired, AUC value evaluation dimensionality reduction matrix and classification accuracy are found out.Compared to the feature dimension reduction method of current medical data collection, method of the invention carries out dimensionality reduction using the local discriminant information of data and global discriminant information simultaneously, the Feature Dimension Reduction problem being applicable not only under general scale, the classification accuracy still with higher when the Feature-scale of data is much larger than sample size.
Description
Technical field
The invention belongs to big data technology and machine learning field, especially a kind of medical data based on sub-space learning
Collect feature dimension reduction method.
Background technique
Feature Dimension Reduction (Dimensionality Reduction) is intended to high dimensional data being converted to low-dimensional data.Feature drop
The appearance of dimension technology is due to can all generate the high dimension of large amount of complex in the Machine Learning Problems that practical application scene generates
According to.The runing time of most of data analysis tasks at least linearly increases with the increase of data dimension, stores, analyzes high dimension
According to needing to consume a large amount of computer storage resources, many calculating times are spent.And times of many data minings and machine learning
Business, such as classification, cluster and recurrence have only obtained effect in lower dimensional space, can be extremely difficult if being placed into higher dimensional space.So
How Feature Dimension Reduction is carried out to high dimensional data, and keeping important information not lose is an extremely urgent problem.
Broadly, whether provided according to the label information of data, feature dimension reduction method can be roughly divided into supervision, half prison
It superintends and directs and unsupervised three categories.Sub-space learning (Subspace Learning) belongs to a kind of linear feature dimension reduction method, that is, recognizes
It can be indicated by the linear transformation of feature vector for " inherent dimension " (the Intrinsic Dimensionality) of data.It is such
Typical method has principal component analysis (Principal Component Analysis, PCA), linear discriminant analysis (Linear
Discriminant Analysis, LDA) and locality preserving projections (Locality Preserving Projection, LPP)
Deng.
But all there are some drawbacks in the existing method based on sub-space learning.If the target of PCA is to pass through maximization
Original high dimensional data linear expression is low-dimensional data by the covariance matrix of data;LDA is then usually the form of mark ratio, it
Pass through and maximizes class scatter matrix and minimize Scatter Matrix in class simultaneously to solve the expression of feature vector.PCA and LDA are
Eigenvalues Decomposition is carried out based on spectral method, the only discriminant information from the point of view of the overall situation, such as the variance of PCA and the mean value of LDA,
The discriminant information provided by sample neighborhood of a point is provided.When sample size is much smaller than intrinsic dimensionality, LDA method is likely to occur
Singular matrix leads to calculated feature vector inaccuracy.And LPP is on the contrary, it passes through the adjacent map of construction sample point, so
After calculate weight, maintain the linear structure of neighbor domain of node, but do not account for the importance of global discriminant information, thus classify
It is ineffective.In addition, existing some feature dimension reduction methods, if PCA and LDA is the feature dimension reduction method of printenv, and all
Assuming that the distribution Gaussian distributed of sample point, thus it is very sensitive to outlier.
Currently, the problem of carrying out Feature Dimension Reduction to high dimensional data is modeled as optimization problem, method for solving is often related to
And Eigenvalues Decomposition, but thering is document to point out, the optimal solution of certain problems cannot be solved by Eigenvalues Decomposition.Alternating direction multiplier
Method (Alternating Direction Method of Multipliers, ADMM) is suitable for solving convex optimization problem, and counts
It calculates efficiently, fast convergence rate, is the emphasis of current area research.
Summary of the invention
The purpose of the present invention is to provide it is a kind of be simple and efficient, the Feature Dimension Reduction that fast convergence rate and classification accuracy are high
Method.
The technical solution for realizing the aim of the invention is as follows: the medical data collection Feature Dimension Reduction side based on sub-space learning
Method, comprising the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize the ginseng of Lagrangian
Several and variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix that medical data concentrates sample
SbWith Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sampleb
With adjacent map G in classw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction
The transformed matrix Q and P of original high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function convergence or
The maximum cycle for reaching setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
Compared with prior art, the present invention its remarkable advantage are as follows: 1) joint is believed using global discriminant information and local discriminant
Breath carries out dimensionality reduction, still can achieve high classification accuracy when sample characteristics dimension is much larger than sample number;2) it utilizesModel
Number regularization term is convenient for feature selecting, and the model robustness trained is good, not vulnerable to outlier so that feature is sparse
Interference;3) different from the current method of many printenvs, the adjustable parameter of feature dimension reduction method proposed by the present invention, so that instruction
The model practised adapts to some specific tasks, and experiments have shown that the selection method of suitable parameter simply can be achieved;4) it proposes
It is very efficient for the method for solving of optimization of the invention, and experiments have shown that model fast convergence rate;5) classification accuracy
It is higher.
Present invention is further described in detail with reference to the accompanying drawing.
Detailed description of the invention
Fig. 1 is that the present invention is based on the flow charts of the medical data collection feature dimension reduction method of sub-space learning.
Fig. 2 is the convergence curve graph in the embodiment of the present invention.
Fig. 3 is the parameter selection figure in the embodiment of the present invention.
Specific embodiment
In conjunction with Fig. 1, the medical data collection feature dimension reduction method of the invention based on sub-space learning, comprising the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize the ginseng of Lagrangian
Several and variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix that medical data concentrates sample
SbWith Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sampleb
With adjacent map G in classw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction
The transformed matrix Q and P of original high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function convergence or
The maximum cycle for reaching setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
Further, original high dimensional data matrix and label column, tool are constructed according to medical data collection to be analyzed in step 1
Body are as follows:
Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character
Dimension;The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data square except the part after first row
Battle array, uses matrixIt indicates;I-th row of the data matrix indicates observed value of i-th of sample under all features, jth
Column indicate all observed values of j-th of feature.
Further, optimization objective function is constructed in step 1, solves its Lagrangian, and it is bright to initialize glug
The parameter and variable of day function, specifically:
Generally, linear character dimensionality reduction is modeled as following optimization problem: for original high dimensional data setThe target of Feature Dimension Reduction is exactly to seek a transformed matrix
Original high dimensional data is mapped as to new low-dimensional dataWherein, yi=ATxi。
For intuitively, if different classes of data are more dispersed, the data of the same category more polymerize, then classifier energy
Preferably distinguish the classification of data.Disperse between this type and the interior degree polymerizeing of class is respectively by divergence in class scatter matrix and class
Matrix is measured.The present invention starts in terms of global discriminant information and local discriminant information two, on the one hand minimizes divergence in class
The mark of matrix maximizes the mark of class scatter matrix, on the other hand utilizes Laplacian matrix, the low-dimensional data after keeping conversion
Local distribution it is consistent with the distribution of original high dimensional data.
Step 2-1, (1) is directed to global discriminant information, and target is to minimize inter- object distance, while maximizing between class distance,
Construct optimization objective function specifically:
(2) be directed to local discriminant information, target be to maintain conversion after low-dimensional data local manifolds structure and original height
Consistent, the primal objective function of construction of dimension data are as follows:
By yi=QTxi, by objective function reduction are as follows:
In summary (1) (2), and increase regularization term and restrictive condition, obtain optimization objective function are as follows:
S.t.X=PQTX+E,PPT=I
In formula,For original high dimensional data matrix, wherein n is total sample number, and m is primitive character dimension, higher-dimension
I-th row of data matrix indicates observed value of i-th of sample under all features, and jth column indicate all observations of j-th of feature
Value;It is transition matrix, for reconstructing original high dimensional data matrix, wherein d is low after reducing
Tie up dimension;Tr (*) is the mark of " * ", ‖ * ‖2,1For " * "Norm, ‖ Q ‖2,1For compensating transformed error, ‖ * ‖1For " * "
Norm,For random error matrix,For unit matrix;Parameter alpha, β, λ1And λ2It is positive real number;SbFor sample
This class scatter matrix, SwFor Scatter Matrix in sample class, LbThe adjacent map G between the class of samplebCorresponding Laplacian matrix,
LwFor adjacent map G in the class of samplewCorresponding Laplacian matrix, the first part that "+" number separates is for seeking global differentiation
Information, the second part that "+" number separates seek local discriminant information;
Step 2-2, the LagrangianL of optimization objective function is soughtρ(P, Q, E, Y) are as follows:
In formula, ρ is punishment parameter, and ρ > 0, Y are Lagrange multiplier;
Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized1=0.01, λ2∈
{1,10-1,10-2,10-3,10-4,10-5, ρ=0.1, μ=0.1, ρmax=105;Iter is the number of iterations;Q=0, E=0, Y=
0, P=PCA (X), wherein PCA (*) indicates that the principal component coefficient of return matrix *, P=PCA (X) are orthogonal for initializing one
Matrix P.
Further, the class that medical data concentrates sample is sought according to original high dimensional data matrix and label column in step 3
Between Scatter Matrix SbWith Scatter Matrix S in classw, specifically:
Seek SbAnd SwFormula be respectively as follows:
In formula,Represent the mean vector of total sample, wherein the original high dimensional data matrix of n includes
Total number of samples amount;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjFor j-th of sample
This, i.e., the jth row data of original high dimensional data matrix;ciFor i-th of classification;And if only if the label of j-th sample and i-th
When the label of classification is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to together
It is a kind of;xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label of k-th sample and i-th
When the label of classification is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
Further, adjacent map G between the class of sample is sought according to original high dimensional data matrix and label column in step 4bWith
Adjacent map G in classwCorresponding Laplacian matrix LbAnd Lw, specifically:
Seek LwAnd LbFormula be respectively as follows:
Lw=Dw-Ww
Lb=Db-Wb
In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. Dii=∑jWij, wherein Wij
For the i-th row jth column element of adjacent map weight matrix;WwAnd WbAdjacent map between the weight matrix and class of adjacent map respectively in class
Weight matrix, be respectively as follows:
In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer;Knn (*) is segmented again:
knnw(*) is neighbour identical with sample point " * " label set, knnb(*) is that the neighbour different from sample point " * " label gathers.
Further, the S of above-mentioned solution is combined in step 5b、Sw、Gw、GbLagrangian is iterated, seeks using
Transformed matrix Q and P, error matrix E and the Lagrange multiplier Y of original high dimensional data matrix are reconstructed in dimensionality reduction, until target letter
Number convergence or the maximum cycle for reaching setting, specifically:
In order to solve Lagrangian, r=X-PQ is enabledTX-E is residual error, then gathers complete square, then Lagrangian letter
Number Lρ(P, Q, E, Y's)It can abbreviation are as follows:
In formula, enableThen final Lagrangian abbreviation are as follows:
(1) iteration updates the formula of Q are as follows:
It enablesSet partial derivativeThen further obtain the formula for updating Q are as follows:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
In formula, U is diagonal matrix, i-th of diagonal element are as follows:
In formula, qiFor the i-th row data of matrix Q;
(2) iteration updates the formula of P are as follows:
It enablesAnd the constant term in cancelling, it obtains:
Solve RXTThe SVD value of Q is RXTQ=USVT, then the formula for updating P is further obtained are as follows:
P=UVT
(3) iteration updates the formula of E are as follows:
It enablesFurther obtain the formula for updating E are as follows:
E=shrink (E0,e)
In formula, shrink indicates convergence operator, specifically:
sign(E0)max(|E0|-e,0)
(4) iteration updates the formula of Y and ρ are as follows:
Y=Y+ ρ (X-PQTX-E)
ρ=min (ρmax,μρ)
In formula, ρmaxIt is constant predetermined with μ.
Further, using transformed matrix Q training classifier in step 6, later according to the AUC value of classifier to matrix
QTX is evaluated, specifically:
Matrix QTThe corresponding medical data of X is concentrated, and the sample label of positive example indicates that the sample label of counter-example is by -1 table by+1
Show;
Seek the AUC value of classifier:
In formula, numposAnd numnegThe respectively quantity of positive and negative samples, I (Ppos,Pneg) are as follows:
In formula, PposIt is the probability of positive example, P for classifier forecast samplenegIt is the probability of counter-example for classifier forecast sample;
The AUC value sought is higher, and presentation class effect is better, that is, the transformed matrix Q sought is better.
Exemplary implement body of preferably, in step 6 classifying uses KNN classifier.
Below with reference to embodiment, the present invention is described in further detail.
Embodiment
The present invention is based on the medical data collection feature dimension reduction methods of sub-space learning, including the following contents:
1, original high dimensional data matrix X and label column are constructed according to medical data collection to be analyzed.
The data set used in the present embodiment is ARCENE data set, serum mass spectrum of this data set from the mankind.
The sample size of ARCENE data set is 900, and characteristic dimension is up to 10000.The task is two classification problems, it is intended to be distinguished
People's (label is+1) and normal person with cancer (label is -1).Entire data set is by two prostate cancer data sets and one
Oophoroma data set merges, data set from National Cancer Institute (National Cancer Institute,
) and Eastern Virginia Medical School (Eastern Virginian Medical School, EVMS) NCI.Data do not have missing values,
And about 44% sample is positive example.Data set is made of three parts: the training dataset with 100 samples, 100 samples
This validation data set, the test data set of 700 samples.It is default to set 10 for d.
2, optimization objective function is constructed, Lagrangian, and the variable and parameter of initialization algorithm are constructed:
The objective function used in the present embodiment is as follows:
S.t.X=PQTX+E,PPT=I
The Lagrangian of above formula are as follows:
In formula, ρ (ρ > 0) is punishment parameter, and Y is Lagrange multiplier.
The assignment of variable, parameter to be initiated is as shown in table 1 below:
1 variable of table and parameter initialization
3, according to original high dimensional data matrix and label column, class scatter matrix S is calculatedbWith Scatter Matrix S in classw, obtain
Global discriminant information:
Calculation formula is respectively as follows:
In formula,Represent the mean vector of total sample, wherein the original high dimensional data matrix of n includes
Total number of samples amount;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjFor j-th of sample
This, i.e., the jth row data of original high dimensional data matrix;ciFor i-th of classification;And if only if the label of j-th sample and i-th
When the label of classification is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to together
It is a kind of;xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label of k-th sample and i-th
When the label of classification is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
The S found out in the present embodimentbAnd SwIt is respectively as follows:
4, according to original high dimensional data matrix and label column, adjacent map G between the class of medical data concentration sample is constructedbAnd class
Interior adjacent map Gw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, obtain local discriminant information.
The L found out in the present embodimentbAnd LwIt is sparse matrix, is respectively as follows:
5, iterate to calculate transformed matrix Q and P, error matrix E and Lagrange multiplier Y, until objective function convergence or
Reach maximum cycle:
(1) Q is found out by following formula:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
The Q acquired in the present embodiment are as follows:
(2) P is found out by following formula:
P=UVT
The P that the present embodiment is found out are as follows:
(3) E is found out by following formula:
E=shrink (E0, e) and the E that finds out in the present embodiment are as follows:
(4) Y is found out by following formula:
Y=Y+ ρ (X-PQTX-E) the Y found out in the present embodiment are as follows:
6, using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix QTX is evaluated.
Convergence:
As shown in Fig. 2, abscissa is the number of iterations, the present embodiment changes the convergence curve graph that the present embodiment obtains
Generation number parameter is set as 100, and left ordinate is classification accuracy (being indicated with the form of percentage), and right ordinate is target letter
Several values.Find that model is no more than 30 times in iteration and reaches convergence by the operation result of embodiment.
Parameter selection:
There are two preset parameter lambdas for the present embodiment1And λ2, the setting of the two parameters is related to established optimization mould
The convergence of type.The method that parameter selection is given below:
The present embodiment is from candidate collection { 10-5,10-4,10-3,10-2,10-1,1,101,102,103,104,105In selection not
Same λ1And λ2Combination of two carry out training pattern, while recording the AUC value under various combination, as a result as shown in Figure 3.By implementing
The operation result discovery of example, for ARCENE data set, works as λ1And λ2When taking 0.01 and 0.1 respectively, the classification accuracy of model is most
It is high.For different data sets, optimal λ1And λ2Value is not exactly the same, through experiments, it was found that, the selection of the parameter can be used
Control variate method, when the value of the one of parameter of fixation, so that the value interval of the optimal another parameter of algorithm effect is also not difficult
It obtains.
The present invention realizes the Feature Dimension Reduction to medical data collection, is believed by calculating global discriminant information and local discriminant
Breath, the optimization objective function proposed according to the present invention, recursive resolve transformed matrix is to reach to the optimal of original high dimensional data
Low-dimensional linear expression.It is demonstrated experimentally that being compared to the feature dimension reduction method of current medical data collection, method of the invention is not only
It is still with higher when the Feature-scale of data is much larger than sample size suitable for the Feature Dimension Reduction problem under general scale
Classification accuracy.In addition, the adjustable parameter of the present invention, so that the model trained adapts to some specific tasks, calculation method
Very efficiently, and it is insensitive for outlier.
Claims (8)
1. a kind of medical data collection feature dimension reduction method based on sub-space learning, which comprises the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize Lagrangian parameter and
Variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix S that medical data concentrates samplebWith
Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of samplebAnd class
Interior adjacent map Gw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction original
The transformed matrix Q and P of high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function is restrained or is reached
The maximum cycle of setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
2. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step
Rapid 1 it is described original high dimensional data matrix and label column are constructed according to medical data collection to be analyzed, specifically:
Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character dimension;
The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data matrix except the part after first row, is used
MatrixIt indicates;I-th row of the data matrix indicates observed value of i-th of sample under all features, jth list
Show all observed values of j-th of feature.
3. the medical data collection feature dimension reduction method according to claim 1 or 2 based on sub-space learning, feature exist
In constructing optimization objective function described in step 1, solve its Lagrangian, and initialize the parameter of Lagrangian
And variable, specifically:
Step 2-1, construction optimization objective function is as follows:
S.t.X=PQTX+E,PPT=I
In formula,For original high dimensional data matrix, wherein n is the total sample number that original high dimensional data matrix includes, m
For primitive character dimension, the i-th row of high dimensional data matrix indicates observed value of i-th of sample under all features, and jth column indicate
All observed values of j-th of feature;It is transition matrix, for reconstructing original high dimensional data matrix,
Wherein, d is the low-dimensional dimension after reduction;Tr (*) is the mark of " * ", ‖ * ‖2,1For the l of " * "2,1Norm, ‖ Q ‖2,1It is converted for compensating
Error, ‖ * ‖1For the l of " * "1Norm,For random error matrix,For unit matrix;Parameter alpha, β, λ1And λ2
It is positive real number;SbFor sample class scatter matrix, SwFor Scatter Matrix in sample class, LbThe adjacent map G between the class of samplebIt is right
The Laplacian matrix answered, LwFor adjacent map G in the class of samplewCorresponding Laplacian matrix, first of "+" number separation
Divide for seeking global discriminant information, the second part that "+" number separates seeks local discriminant information;
Step 2-2, the LagrangianL of optimization objective function is soughtρ(P, Q, E, Y) are as follows:
In formula, ρ is punishment parameter, and ρ > 0, Y are Lagrange multiplier;
Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized1=0.01, λ2∈{1,10-1,10-2,10-3,10-4,10-5, ρ=0.1, μ=0.1, ρmax=105;Iter is the number of iterations;Q=0, E=0, Y=0, P=
PCA (X), wherein PCA (*) indicates the principal component coefficient of return matrix *, and P=PCA (X) is for initializing an orthogonal matrix P.
4. the medical data collection feature dimension reduction method according to claim 3 based on sub-space learning, which is characterized in that step
Rapid 3 is described according to original high dimensional data matrix and label column, seeks the class scatter matrix S that medical data concentrates samplebAnd class
Interior Scatter Matrix Sw, specifically:
Seek SbAnd SwFormula be respectively as follows:
In formula,Represent the mean vector of total sample, wherein n is the sample that original high dimensional data matrix includes
Sum;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjIt is for j-th of sample, i.e., former
The jth row data of beginning high dimensional data matrix;ciFor i-th of classification;And if only if the label and i-th classification of j-th sample
When label is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to same class;
xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label and i-th classification of k-th sample
When label is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
5. the medical data collection feature dimension reduction method according to claim 4 based on sub-space learning, which is characterized in that step
Rapid 4 is described according to original high dimensional data matrix and label column, seeks adjacent map G between the class of samplebWith adjacent map G in classwAccordingly
Laplacian matrix LbAnd Lw, specifically:
Seek LwAnd LbFormula be respectively as follows:
Lw=Dw-Ww
Lb=Db-Wb
In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. Dii=∑jWij, wherein WijFor adjoining
I-th row jth column element of figure weight matrix;WwAnd WbRespectively in class between the weight matrix and class of adjacent map adjacent map weight
Matrix is respectively as follows:
In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer;Knn (*) is segmented again: knnw(*)
For neighbour identical with sample point " * " label set, knnb(*) is that the neighbour different from sample point " * " label gathers.
6. the medical data collection feature dimension reduction method according to claim 5 based on sub-space learning, which is characterized in that step
The S of the rapid 5 above-mentioned solution of combinationb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing original height for dimensionality reduction
The transformed matrix Q and P of dimension data matrix, error matrix E and Lagrange multiplier Y restrain or reach setting until objective function
Fixed maximum cycle, specifically:
(1) iteration updates the formula of Q are as follows:
It enablesSet partial derivativeThen further obtain the formula for updating Q are as follows:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
In formula, U is diagonal matrix, i-th of diagonal element are as follows:
In formula, qiFor the i-th row data of matrix Q;
(2) iteration updates the formula of P are as follows:
It enablesAnd the constant term in cancelling, it obtains:
Solve RXTThe SVD value of Q is RXTQ=USVT, then the formula for updating P is further obtained are as follows:
P=UVT
(3) iteration updates the formula of E are as follows:
It enablesFurther obtain the formula for updating E are as follows:
E=shrink (E0,e)
In formula, shrink indicates convergence operator, specifically:
sign(E0)max(|E0|-e,0)
(4) iteration updates the formula of Y and ρ are as follows:
Y=Y+ ρ (X-PQTX-E)
ρ=min (ρmax,μρ)
In formula, ρmaxIt is constant predetermined with μ.
7. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step
Rapid 6 is described using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix QTX is evaluated, specifically:
Matrix QTThe corresponding medical data of X is concentrated, and the sample label of positive example is indicated by+1, and the sample label of counter-example is indicated by -1;
Seek the AUC value of classifier:
In formula, numposAnd numnegThe respectively quantity of positive and negative samples, I (Ppos,Pneg) are as follows:
In formula, PposIt is the probability of positive example, P for classifier forecast samplenegIt is the probability of counter-example for classifier forecast sample;
The AUC value sought is higher, and presentation class effect is better, that is, the transformed matrix Q sought is better.
8. the medical data collection feature dimension reduction method according to claim 7 based on sub-space learning, which is characterized in that institute
Classification implement body is stated using KNN classifier.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910546805.9A CN110364264A (en) | 2019-06-24 | 2019-06-24 | Medical data collection feature dimension reduction method based on sub-space learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910546805.9A CN110364264A (en) | 2019-06-24 | 2019-06-24 | Medical data collection feature dimension reduction method based on sub-space learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110364264A true CN110364264A (en) | 2019-10-22 |
Family
ID=68216762
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910546805.9A Pending CN110364264A (en) | 2019-06-24 | 2019-06-24 | Medical data collection feature dimension reduction method based on sub-space learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110364264A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401471A (en) * | 2020-04-08 | 2020-07-10 | 中国人民解放军国防科技大学 | Spacecraft attitude anomaly detection method and system |
CN112132624A (en) * | 2020-09-27 | 2020-12-25 | 平安医疗健康管理股份有限公司 | Medical claims data prediction system |
CN114677550A (en) * | 2022-02-25 | 2022-06-28 | 西北工业大学 | Rapid image pixel screening method based on sparse discriminant K-means |
CN114897796A (en) * | 2022-04-22 | 2022-08-12 | 深圳市铱硙医疗科技有限公司 | Method, device, equipment and medium for judging stability of atherosclerotic plaque |
-
2019
- 2019-06-24 CN CN201910546805.9A patent/CN110364264A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401471A (en) * | 2020-04-08 | 2020-07-10 | 中国人民解放军国防科技大学 | Spacecraft attitude anomaly detection method and system |
CN111401471B (en) * | 2020-04-08 | 2023-04-18 | 中国人民解放军国防科技大学 | Spacecraft attitude anomaly detection method and system |
CN112132624A (en) * | 2020-09-27 | 2020-12-25 | 平安医疗健康管理股份有限公司 | Medical claims data prediction system |
CN114677550A (en) * | 2022-02-25 | 2022-06-28 | 西北工业大学 | Rapid image pixel screening method based on sparse discriminant K-means |
CN114677550B (en) * | 2022-02-25 | 2024-02-27 | 西北工业大学 | Rapid image pixel screening method based on sparse discrimination K-means |
CN114897796A (en) * | 2022-04-22 | 2022-08-12 | 深圳市铱硙医疗科技有限公司 | Method, device, equipment and medium for judging stability of atherosclerotic plaque |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | $ p $-Laplacian regularization for scene recognition | |
Zhai et al. | Hyperspectral image clustering: Current achievements and future lines | |
CN110364264A (en) | Medical data collection feature dimension reduction method based on sub-space learning | |
Pickup et al. | Shape retrieval of non-rigid 3d human models | |
Huang et al. | Analysis and synthesis of 3D shape families via deep‐learned generative models of surfaces | |
Kapoor et al. | Active learning with gaussian processes for object categorization | |
Qian et al. | Cluster prototypes and fuzzy memberships jointly leveraged cross-domain maximum entropy clustering | |
Zhong et al. | An adaptive artificial immune network for supervised classification of multi-/hyperspectral remote sensing imagery | |
Wen et al. | Handwritten Bangla numeral recognition system and its application to postal automation | |
He et al. | Laplacian regularized Gaussian mixture model for data clustering | |
Guo et al. | Supplier selection based on hierarchical potential support vector machine | |
CN102663100B (en) | Two-stage hybrid particle swarm optimization clustering method | |
Jia et al. | Bagging-based spectral clustering ensemble selection | |
CN102324038B (en) | Plant species identification method based on digital image | |
CN102930301A (en) | Image classification method based on characteristic weight learning and nuclear sparse representation | |
CN105005789B (en) | A kind of remote sensing images terrain classification method of view-based access control model vocabulary | |
CN105678261B (en) | Based on the direct-push Method of Data with Adding Windows for having supervision figure | |
CN104268556A (en) | Hyperspectral image classification method based on nuclear low-rank representing graph and spatial constraint | |
CN102867191A (en) | Dimension reducing method based on manifold sub-space study | |
CN104268507A (en) | Manual alphabet identification method based on RGB-D image | |
CN105868796A (en) | Design method for linear discrimination of sparse representation classifier based on nuclear space | |
CN107341505A (en) | A kind of scene classification method based on saliency Yu Object Bank | |
CN109272029B (en) | Well control sparse representation large-scale spectral clustering seismic facies partitioning method | |
CN103456017B (en) | Image partition method based on the semi-supervised weight Kernel fuzzy clustering of subset | |
CN109635140A (en) | A kind of image search method clustered based on deep learning and density peaks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191022 |
|
RJ01 | Rejection of invention patent application after publication |