CN110364264A - Medical data collection feature dimension reduction method based on sub-space learning - Google Patents

Medical data collection feature dimension reduction method based on sub-space learning Download PDF

Info

Publication number
CN110364264A
CN110364264A CN201910546805.9A CN201910546805A CN110364264A CN 110364264 A CN110364264 A CN 110364264A CN 201910546805 A CN201910546805 A CN 201910546805A CN 110364264 A CN110364264 A CN 110364264A
Authority
CN
China
Prior art keywords
matrix
sample
class
formula
high dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910546805.9A
Other languages
Chinese (zh)
Inventor
庾安妮
徐雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN201910546805.9A priority Critical patent/CN110364264A/en
Publication of CN110364264A publication Critical patent/CN110364264A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of medical data collection feature dimension reduction method based on sub-space learning, method includes the following steps: constructing original high dimensional data matrix X and label column according to medical data collection to be analyzed;Optimization objective function is constructed, its Lagrangian is solved;According to original high dimensional data matrix and label column, global discriminant information and local discriminant information are calculated;Transformed matrix Q is iteratively solved, until objective function convergence or reaches maximum cycle, the data matrix after obtaining dimensionality reduction;According to the transformed matrix training pattern acquired, AUC value evaluation dimensionality reduction matrix and classification accuracy are found out.Compared to the feature dimension reduction method of current medical data collection, method of the invention carries out dimensionality reduction using the local discriminant information of data and global discriminant information simultaneously, the Feature Dimension Reduction problem being applicable not only under general scale, the classification accuracy still with higher when the Feature-scale of data is much larger than sample size.

Description

Medical data collection feature dimension reduction method based on sub-space learning
Technical field
The invention belongs to big data technology and machine learning field, especially a kind of medical data based on sub-space learning Collect feature dimension reduction method.
Background technique
Feature Dimension Reduction (Dimensionality Reduction) is intended to high dimensional data being converted to low-dimensional data.Feature drop The appearance of dimension technology is due to can all generate the high dimension of large amount of complex in the Machine Learning Problems that practical application scene generates According to.The runing time of most of data analysis tasks at least linearly increases with the increase of data dimension, stores, analyzes high dimension According to needing to consume a large amount of computer storage resources, many calculating times are spent.And times of many data minings and machine learning Business, such as classification, cluster and recurrence have only obtained effect in lower dimensional space, can be extremely difficult if being placed into higher dimensional space.So How Feature Dimension Reduction is carried out to high dimensional data, and keeping important information not lose is an extremely urgent problem.
Broadly, whether provided according to the label information of data, feature dimension reduction method can be roughly divided into supervision, half prison It superintends and directs and unsupervised three categories.Sub-space learning (Subspace Learning) belongs to a kind of linear feature dimension reduction method, that is, recognizes It can be indicated by the linear transformation of feature vector for " inherent dimension " (the Intrinsic Dimensionality) of data.It is such Typical method has principal component analysis (Principal Component Analysis, PCA), linear discriminant analysis (Linear Discriminant Analysis, LDA) and locality preserving projections (Locality Preserving Projection, LPP) Deng.
But all there are some drawbacks in the existing method based on sub-space learning.If the target of PCA is to pass through maximization Original high dimensional data linear expression is low-dimensional data by the covariance matrix of data;LDA is then usually the form of mark ratio, it Pass through and maximizes class scatter matrix and minimize Scatter Matrix in class simultaneously to solve the expression of feature vector.PCA and LDA are Eigenvalues Decomposition is carried out based on spectral method, the only discriminant information from the point of view of the overall situation, such as the variance of PCA and the mean value of LDA, The discriminant information provided by sample neighborhood of a point is provided.When sample size is much smaller than intrinsic dimensionality, LDA method is likely to occur Singular matrix leads to calculated feature vector inaccuracy.And LPP is on the contrary, it passes through the adjacent map of construction sample point, so After calculate weight, maintain the linear structure of neighbor domain of node, but do not account for the importance of global discriminant information, thus classify It is ineffective.In addition, existing some feature dimension reduction methods, if PCA and LDA is the feature dimension reduction method of printenv, and all Assuming that the distribution Gaussian distributed of sample point, thus it is very sensitive to outlier.
Currently, the problem of carrying out Feature Dimension Reduction to high dimensional data is modeled as optimization problem, method for solving is often related to And Eigenvalues Decomposition, but thering is document to point out, the optimal solution of certain problems cannot be solved by Eigenvalues Decomposition.Alternating direction multiplier Method (Alternating Direction Method of Multipliers, ADMM) is suitable for solving convex optimization problem, and counts It calculates efficiently, fast convergence rate, is the emphasis of current area research.
Summary of the invention
The purpose of the present invention is to provide it is a kind of be simple and efficient, the Feature Dimension Reduction that fast convergence rate and classification accuracy are high Method.
The technical solution for realizing the aim of the invention is as follows: the medical data collection Feature Dimension Reduction side based on sub-space learning Method, comprising the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize the ginseng of Lagrangian Several and variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix that medical data concentrates sample SbWith Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sampleb With adjacent map G in classw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction The transformed matrix Q and P of original high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function convergence or The maximum cycle for reaching setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
Compared with prior art, the present invention its remarkable advantage are as follows: 1) joint is believed using global discriminant information and local discriminant Breath carries out dimensionality reduction, still can achieve high classification accuracy when sample characteristics dimension is much larger than sample number;2) it utilizesModel Number regularization term is convenient for feature selecting, and the model robustness trained is good, not vulnerable to outlier so that feature is sparse Interference;3) different from the current method of many printenvs, the adjustable parameter of feature dimension reduction method proposed by the present invention, so that instruction The model practised adapts to some specific tasks, and experiments have shown that the selection method of suitable parameter simply can be achieved;4) it proposes It is very efficient for the method for solving of optimization of the invention, and experiments have shown that model fast convergence rate;5) classification accuracy It is higher.
Present invention is further described in detail with reference to the accompanying drawing.
Detailed description of the invention
Fig. 1 is that the present invention is based on the flow charts of the medical data collection feature dimension reduction method of sub-space learning.
Fig. 2 is the convergence curve graph in the embodiment of the present invention.
Fig. 3 is the parameter selection figure in the embodiment of the present invention.
Specific embodiment
In conjunction with Fig. 1, the medical data collection feature dimension reduction method of the invention based on sub-space learning, comprising the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize the ginseng of Lagrangian Several and variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix that medical data concentrates sample SbWith Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sampleb With adjacent map G in classw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction The transformed matrix Q and P of original high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function convergence or The maximum cycle for reaching setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
Further, original high dimensional data matrix and label column, tool are constructed according to medical data collection to be analyzed in step 1 Body are as follows:
Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character Dimension;The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data square except the part after first row Battle array, uses matrixIt indicates;I-th row of the data matrix indicates observed value of i-th of sample under all features, jth Column indicate all observed values of j-th of feature.
Further, optimization objective function is constructed in step 1, solves its Lagrangian, and it is bright to initialize glug The parameter and variable of day function, specifically:
Generally, linear character dimensionality reduction is modeled as following optimization problem: for original high dimensional data setThe target of Feature Dimension Reduction is exactly to seek a transformed matrix Original high dimensional data is mapped as to new low-dimensional dataWherein, yi=ATxi
For intuitively, if different classes of data are more dispersed, the data of the same category more polymerize, then classifier energy Preferably distinguish the classification of data.Disperse between this type and the interior degree polymerizeing of class is respectively by divergence in class scatter matrix and class Matrix is measured.The present invention starts in terms of global discriminant information and local discriminant information two, on the one hand minimizes divergence in class The mark of matrix maximizes the mark of class scatter matrix, on the other hand utilizes Laplacian matrix, the low-dimensional data after keeping conversion Local distribution it is consistent with the distribution of original high dimensional data.
Step 2-1, (1) is directed to global discriminant information, and target is to minimize inter- object distance, while maximizing between class distance, Construct optimization objective function specifically:
(2) be directed to local discriminant information, target be to maintain conversion after low-dimensional data local manifolds structure and original height Consistent, the primal objective function of construction of dimension data are as follows:
By yi=QTxi, by objective function reduction are as follows:
In summary (1) (2), and increase regularization term and restrictive condition, obtain optimization objective function are as follows:
S.t.X=PQTX+E,PPT=I
In formula,For original high dimensional data matrix, wherein n is total sample number, and m is primitive character dimension, higher-dimension I-th row of data matrix indicates observed value of i-th of sample under all features, and jth column indicate all observations of j-th of feature Value;It is transition matrix, for reconstructing original high dimensional data matrix, wherein d is low after reducing Tie up dimension;Tr (*) is the mark of " * ", ‖ * ‖2,1For " * "Norm, ‖ Q ‖2,1For compensating transformed error, ‖ * ‖1For " * " Norm,For random error matrix,For unit matrix;Parameter alpha, β, λ1And λ2It is positive real number;SbFor sample This class scatter matrix, SwFor Scatter Matrix in sample class, LbThe adjacent map G between the class of samplebCorresponding Laplacian matrix, LwFor adjacent map G in the class of samplewCorresponding Laplacian matrix, the first part that "+" number separates is for seeking global differentiation Information, the second part that "+" number separates seek local discriminant information;
Step 2-2, the LagrangianL of optimization objective function is soughtρ(P, Q, E, Y) are as follows:
In formula, ρ is punishment parameter, and ρ > 0, Y are Lagrange multiplier;
Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized1=0.01, λ2∈ {1,10-1,10-2,10-3,10-4,10-5, ρ=0.1, μ=0.1, ρmax=105;Iter is the number of iterations;Q=0, E=0, Y= 0, P=PCA (X), wherein PCA (*) indicates that the principal component coefficient of return matrix *, P=PCA (X) are orthogonal for initializing one Matrix P.
Further, the class that medical data concentrates sample is sought according to original high dimensional data matrix and label column in step 3 Between Scatter Matrix SbWith Scatter Matrix S in classw, specifically:
Seek SbAnd SwFormula be respectively as follows:
In formula,Represent the mean vector of total sample, wherein the original high dimensional data matrix of n includes Total number of samples amount;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjFor j-th of sample This, i.e., the jth row data of original high dimensional data matrix;ciFor i-th of classification;And if only if the label of j-th sample and i-th When the label of classification is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to together It is a kind of;xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label of k-th sample and i-th When the label of classification is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
Further, adjacent map G between the class of sample is sought according to original high dimensional data matrix and label column in step 4bWith Adjacent map G in classwCorresponding Laplacian matrix LbAnd Lw, specifically:
Seek LwAnd LbFormula be respectively as follows:
Lw=Dw-Ww
Lb=Db-Wb
In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. Dii=∑jWij, wherein Wij For the i-th row jth column element of adjacent map weight matrix;WwAnd WbAdjacent map between the weight matrix and class of adjacent map respectively in class Weight matrix, be respectively as follows:
In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer;Knn (*) is segmented again: knnw(*) is neighbour identical with sample point " * " label set, knnb(*) is that the neighbour different from sample point " * " label gathers.
Further, the S of above-mentioned solution is combined in step 5b、Sw、Gw、GbLagrangian is iterated, seeks using Transformed matrix Q and P, error matrix E and the Lagrange multiplier Y of original high dimensional data matrix are reconstructed in dimensionality reduction, until target letter Number convergence or the maximum cycle for reaching setting, specifically:
In order to solve Lagrangian, r=X-PQ is enabledTX-E is residual error, then gathers complete square, then Lagrangian letter Number Lρ(P, Q, E, Y's)It can abbreviation are as follows:
In formula, enableThen final Lagrangian abbreviation are as follows:
(1) iteration updates the formula of Q are as follows:
It enablesSet partial derivativeThen further obtain the formula for updating Q are as follows:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
In formula, U is diagonal matrix, i-th of diagonal element are as follows:
In formula, qiFor the i-th row data of matrix Q;
(2) iteration updates the formula of P are as follows:
It enablesAnd the constant term in cancelling, it obtains:
Solve RXTThe SVD value of Q is RXTQ=USVT, then the formula for updating P is further obtained are as follows:
P=UVT
(3) iteration updates the formula of E are as follows:
It enablesFurther obtain the formula for updating E are as follows:
E=shrink (E0,e)
In formula, shrink indicates convergence operator, specifically:
sign(E0)max(|E0|-e,0)
(4) iteration updates the formula of Y and ρ are as follows:
Y=Y+ ρ (X-PQTX-E)
ρ=min (ρmax,μρ)
In formula, ρmaxIt is constant predetermined with μ.
Further, using transformed matrix Q training classifier in step 6, later according to the AUC value of classifier to matrix QTX is evaluated, specifically:
Matrix QTThe corresponding medical data of X is concentrated, and the sample label of positive example indicates that the sample label of counter-example is by -1 table by+1 Show;
Seek the AUC value of classifier:
In formula, numposAnd numnegThe respectively quantity of positive and negative samples, I (Ppos,Pneg) are as follows:
In formula, PposIt is the probability of positive example, P for classifier forecast samplenegIt is the probability of counter-example for classifier forecast sample;
The AUC value sought is higher, and presentation class effect is better, that is, the transformed matrix Q sought is better.
Exemplary implement body of preferably, in step 6 classifying uses KNN classifier.
Below with reference to embodiment, the present invention is described in further detail.
Embodiment
The present invention is based on the medical data collection feature dimension reduction methods of sub-space learning, including the following contents:
1, original high dimensional data matrix X and label column are constructed according to medical data collection to be analyzed.
The data set used in the present embodiment is ARCENE data set, serum mass spectrum of this data set from the mankind. The sample size of ARCENE data set is 900, and characteristic dimension is up to 10000.The task is two classification problems, it is intended to be distinguished People's (label is+1) and normal person with cancer (label is -1).Entire data set is by two prostate cancer data sets and one Oophoroma data set merges, data set from National Cancer Institute (National Cancer Institute, ) and Eastern Virginia Medical School (Eastern Virginian Medical School, EVMS) NCI.Data do not have missing values, And about 44% sample is positive example.Data set is made of three parts: the training dataset with 100 samples, 100 samples This validation data set, the test data set of 700 samples.It is default to set 10 for d.
2, optimization objective function is constructed, Lagrangian, and the variable and parameter of initialization algorithm are constructed:
The objective function used in the present embodiment is as follows:
S.t.X=PQTX+E,PPT=I
The Lagrangian of above formula are as follows:
In formula, ρ (ρ > 0) is punishment parameter, and Y is Lagrange multiplier.
The assignment of variable, parameter to be initiated is as shown in table 1 below:
1 variable of table and parameter initialization
3, according to original high dimensional data matrix and label column, class scatter matrix S is calculatedbWith Scatter Matrix S in classw, obtain Global discriminant information:
Calculation formula is respectively as follows:
In formula,Represent the mean vector of total sample, wherein the original high dimensional data matrix of n includes Total number of samples amount;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjFor j-th of sample This, i.e., the jth row data of original high dimensional data matrix;ciFor i-th of classification;And if only if the label of j-th sample and i-th When the label of classification is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to together It is a kind of;xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label of k-th sample and i-th When the label of classification is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
The S found out in the present embodimentbAnd SwIt is respectively as follows:
4, according to original high dimensional data matrix and label column, adjacent map G between the class of medical data concentration sample is constructedbAnd class Interior adjacent map Gw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, obtain local discriminant information.
The L found out in the present embodimentbAnd LwIt is sparse matrix, is respectively as follows:
5, iterate to calculate transformed matrix Q and P, error matrix E and Lagrange multiplier Y, until objective function convergence or Reach maximum cycle:
(1) Q is found out by following formula:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
The Q acquired in the present embodiment are as follows:
(2) P is found out by following formula:
P=UVT
The P that the present embodiment is found out are as follows:
(3) E is found out by following formula:
E=shrink (E0, e) and the E that finds out in the present embodiment are as follows:
(4) Y is found out by following formula:
Y=Y+ ρ (X-PQTX-E) the Y found out in the present embodiment are as follows:
6, using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix QTX is evaluated.
Convergence:
As shown in Fig. 2, abscissa is the number of iterations, the present embodiment changes the convergence curve graph that the present embodiment obtains Generation number parameter is set as 100, and left ordinate is classification accuracy (being indicated with the form of percentage), and right ordinate is target letter Several values.Find that model is no more than 30 times in iteration and reaches convergence by the operation result of embodiment.
Parameter selection:
There are two preset parameter lambdas for the present embodiment1And λ2, the setting of the two parameters is related to established optimization mould The convergence of type.The method that parameter selection is given below:
The present embodiment is from candidate collection { 10-5,10-4,10-3,10-2,10-1,1,101,102,103,104,105In selection not Same λ1And λ2Combination of two carry out training pattern, while recording the AUC value under various combination, as a result as shown in Figure 3.By implementing The operation result discovery of example, for ARCENE data set, works as λ1And λ2When taking 0.01 and 0.1 respectively, the classification accuracy of model is most It is high.For different data sets, optimal λ1And λ2Value is not exactly the same, through experiments, it was found that, the selection of the parameter can be used Control variate method, when the value of the one of parameter of fixation, so that the value interval of the optimal another parameter of algorithm effect is also not difficult It obtains.
The present invention realizes the Feature Dimension Reduction to medical data collection, is believed by calculating global discriminant information and local discriminant Breath, the optimization objective function proposed according to the present invention, recursive resolve transformed matrix is to reach to the optimal of original high dimensional data Low-dimensional linear expression.It is demonstrated experimentally that being compared to the feature dimension reduction method of current medical data collection, method of the invention is not only It is still with higher when the Feature-scale of data is much larger than sample size suitable for the Feature Dimension Reduction problem under general scale Classification accuracy.In addition, the adjustable parameter of the present invention, so that the model trained adapts to some specific tasks, calculation method Very efficiently, and it is insensitive for outlier.

Claims (8)

1. a kind of medical data collection feature dimension reduction method based on sub-space learning, which comprises the following steps:
Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed;
Step 2, construction optimization objective function, solve its Lagrangian, and initialize Lagrangian parameter and Variable;
Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix S that medical data concentrates samplebWith Scatter Matrix S in classw, thus to obtain global discriminant information;
Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of samplebAnd class Interior adjacent map Gw, and corresponding Laplacian matrix L is sought respectivelybAnd Lw, thus to obtain local discriminant information;
Step 5, in conjunction with the S of above-mentioned solutionb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing for dimensionality reduction original The transformed matrix Q and P of high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function is restrained or is reached The maximum cycle of setting, the matrix Q finally obtainedTX is the data matrix after dimensionality reduction;
Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix QTX is evaluated.
2. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step Rapid 1 it is described original high dimensional data matrix and label column are constructed according to medical data collection to be analyzed, specifically:
Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character dimension; The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data matrix except the part after first row, is used MatrixIt indicates;I-th row of the data matrix indicates observed value of i-th of sample under all features, jth list Show all observed values of j-th of feature.
3. the medical data collection feature dimension reduction method according to claim 1 or 2 based on sub-space learning, feature exist In constructing optimization objective function described in step 1, solve its Lagrangian, and initialize the parameter of Lagrangian And variable, specifically:
Step 2-1, construction optimization objective function is as follows:
S.t.X=PQTX+E,PPT=I
In formula,For original high dimensional data matrix, wherein n is the total sample number that original high dimensional data matrix includes, m For primitive character dimension, the i-th row of high dimensional data matrix indicates observed value of i-th of sample under all features, and jth column indicate All observed values of j-th of feature;It is transition matrix, for reconstructing original high dimensional data matrix, Wherein, d is the low-dimensional dimension after reduction;Tr (*) is the mark of " * ", ‖ * ‖2,1For the l of " * "2,1Norm, ‖ Q ‖2,1It is converted for compensating Error, ‖ * ‖1For the l of " * "1Norm,For random error matrix,For unit matrix;Parameter alpha, β, λ1And λ2 It is positive real number;SbFor sample class scatter matrix, SwFor Scatter Matrix in sample class, LbThe adjacent map G between the class of samplebIt is right The Laplacian matrix answered, LwFor adjacent map G in the class of samplewCorresponding Laplacian matrix, first of "+" number separation Divide for seeking global discriminant information, the second part that "+" number separates seeks local discriminant information;
Step 2-2, the LagrangianL of optimization objective function is soughtρ(P, Q, E, Y) are as follows:
In formula, ρ is punishment parameter, and ρ > 0, Y are Lagrange multiplier;
Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized1=0.01, λ2∈{1,10-1,10-2,10-3,10-4,10-5, ρ=0.1, μ=0.1, ρmax=105;Iter is the number of iterations;Q=0, E=0, Y=0, P= PCA (X), wherein PCA (*) indicates the principal component coefficient of return matrix *, and P=PCA (X) is for initializing an orthogonal matrix P.
4. the medical data collection feature dimension reduction method according to claim 3 based on sub-space learning, which is characterized in that step Rapid 3 is described according to original high dimensional data matrix and label column, seeks the class scatter matrix S that medical data concentrates samplebAnd class Interior Scatter Matrix Sw, specifically:
Seek SbAnd SwFormula be respectively as follows:
In formula,Represent the mean vector of total sample, wherein n is the sample that original high dimensional data matrix includes Sum;C is the classification number of original high dimensional data matrix, the corresponding classification of one of label;xjIt is for j-th of sample, i.e., former The jth row data of beginning high dimensional data matrix;ciFor i-th of classification;And if only if the label and i-th classification of j-th sample When label is consistent, xj∈ciValue be true;niIt is the total sample number of the i-th class, wherein the identical sample of label belongs to same class; xkFor k-th of sample;XiBelong to the sample set of the i-th class for classification;And if only if the label and i-th classification of k-th sample When label is consistent, xj∈XiValue be true;For the mean vector of the i-th class sample.
5. the medical data collection feature dimension reduction method according to claim 4 based on sub-space learning, which is characterized in that step Rapid 4 is described according to original high dimensional data matrix and label column, seeks adjacent map G between the class of samplebWith adjacent map G in classwAccordingly Laplacian matrix LbAnd Lw, specifically:
Seek LwAnd LbFormula be respectively as follows:
Lw=Dw-Ww
Lb=Db-Wb
In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. Dii=∑jWij, wherein WijFor adjoining I-th row jth column element of figure weight matrix;WwAnd WbRespectively in class between the weight matrix and class of adjacent map adjacent map weight Matrix is respectively as follows:
In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer;Knn (*) is segmented again: knnw(*) For neighbour identical with sample point " * " label set, knnb(*) is that the neighbour different from sample point " * " label gathers.
6. the medical data collection feature dimension reduction method according to claim 5 based on sub-space learning, which is characterized in that step The S of the rapid 5 above-mentioned solution of combinationb、Sw、Gw、GbLagrangian is iterated, seeks reconstructing original height for dimensionality reduction The transformed matrix Q and P of dimension data matrix, error matrix E and Lagrange multiplier Y restrain or reach setting until objective function Fixed maximum cycle, specifically:
(1) iteration updates the formula of Q are as follows:
It enablesSet partial derivativeThen further obtain the formula for updating Q are as follows:
Q=(2 (Sw-αSb+βX(Lw-Lb)XT)+λ1U+ρXXT)-1(ρRTPX)
In formula, U is diagonal matrix, i-th of diagonal element are as follows:
In formula, qiFor the i-th row data of matrix Q;
(2) iteration updates the formula of P are as follows:
It enablesAnd the constant term in cancelling, it obtains:
Solve RXTThe SVD value of Q is RXTQ=USVT, then the formula for updating P is further obtained are as follows:
P=UVT
(3) iteration updates the formula of E are as follows:
It enablesFurther obtain the formula for updating E are as follows:
E=shrink (E0,e)
In formula, shrink indicates convergence operator, specifically:
sign(E0)max(|E0|-e,0)
(4) iteration updates the formula of Y and ρ are as follows:
Y=Y+ ρ (X-PQTX-E)
ρ=min (ρmax,μρ)
In formula, ρmaxIt is constant predetermined with μ.
7. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step Rapid 6 is described using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix QTX is evaluated, specifically:
Matrix QTThe corresponding medical data of X is concentrated, and the sample label of positive example is indicated by+1, and the sample label of counter-example is indicated by -1;
Seek the AUC value of classifier:
In formula, numposAnd numnegThe respectively quantity of positive and negative samples, I (Ppos,Pneg) are as follows:
In formula, PposIt is the probability of positive example, P for classifier forecast samplenegIt is the probability of counter-example for classifier forecast sample;
The AUC value sought is higher, and presentation class effect is better, that is, the transformed matrix Q sought is better.
8. the medical data collection feature dimension reduction method according to claim 7 based on sub-space learning, which is characterized in that institute Classification implement body is stated using KNN classifier.
CN201910546805.9A 2019-06-24 2019-06-24 Medical data collection feature dimension reduction method based on sub-space learning Pending CN110364264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910546805.9A CN110364264A (en) 2019-06-24 2019-06-24 Medical data collection feature dimension reduction method based on sub-space learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910546805.9A CN110364264A (en) 2019-06-24 2019-06-24 Medical data collection feature dimension reduction method based on sub-space learning

Publications (1)

Publication Number Publication Date
CN110364264A true CN110364264A (en) 2019-10-22

Family

ID=68216762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910546805.9A Pending CN110364264A (en) 2019-06-24 2019-06-24 Medical data collection feature dimension reduction method based on sub-space learning

Country Status (1)

Country Link
CN (1) CN110364264A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401471A (en) * 2020-04-08 2020-07-10 中国人民解放军国防科技大学 Spacecraft attitude anomaly detection method and system
CN112132624A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Medical claims data prediction system
CN114677550A (en) * 2022-02-25 2022-06-28 西北工业大学 Rapid image pixel screening method based on sparse discriminant K-means
CN114897796A (en) * 2022-04-22 2022-08-12 深圳市铱硙医疗科技有限公司 Method, device, equipment and medium for judging stability of atherosclerotic plaque

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401471A (en) * 2020-04-08 2020-07-10 中国人民解放军国防科技大学 Spacecraft attitude anomaly detection method and system
CN111401471B (en) * 2020-04-08 2023-04-18 中国人民解放军国防科技大学 Spacecraft attitude anomaly detection method and system
CN112132624A (en) * 2020-09-27 2020-12-25 平安医疗健康管理股份有限公司 Medical claims data prediction system
CN114677550A (en) * 2022-02-25 2022-06-28 西北工业大学 Rapid image pixel screening method based on sparse discriminant K-means
CN114677550B (en) * 2022-02-25 2024-02-27 西北工业大学 Rapid image pixel screening method based on sparse discrimination K-means
CN114897796A (en) * 2022-04-22 2022-08-12 深圳市铱硙医疗科技有限公司 Method, device, equipment and medium for judging stability of atherosclerotic plaque

Similar Documents

Publication Publication Date Title
Liu et al. $ p $-Laplacian regularization for scene recognition
Zhai et al. Hyperspectral image clustering: Current achievements and future lines
CN110364264A (en) Medical data collection feature dimension reduction method based on sub-space learning
Pickup et al. Shape retrieval of non-rigid 3d human models
Huang et al. Analysis and synthesis of 3D shape families via deep‐learned generative models of surfaces
Kapoor et al. Active learning with gaussian processes for object categorization
Qian et al. Cluster prototypes and fuzzy memberships jointly leveraged cross-domain maximum entropy clustering
Zhong et al. An adaptive artificial immune network for supervised classification of multi-/hyperspectral remote sensing imagery
Wen et al. Handwritten Bangla numeral recognition system and its application to postal automation
He et al. Laplacian regularized Gaussian mixture model for data clustering
Guo et al. Supplier selection based on hierarchical potential support vector machine
CN102663100B (en) Two-stage hybrid particle swarm optimization clustering method
Jia et al. Bagging-based spectral clustering ensemble selection
CN102324038B (en) Plant species identification method based on digital image
CN102930301A (en) Image classification method based on characteristic weight learning and nuclear sparse representation
CN105005789B (en) A kind of remote sensing images terrain classification method of view-based access control model vocabulary
CN105678261B (en) Based on the direct-push Method of Data with Adding Windows for having supervision figure
CN104268556A (en) Hyperspectral image classification method based on nuclear low-rank representing graph and spatial constraint
CN102867191A (en) Dimension reducing method based on manifold sub-space study
CN104268507A (en) Manual alphabet identification method based on RGB-D image
CN105868796A (en) Design method for linear discrimination of sparse representation classifier based on nuclear space
CN107341505A (en) A kind of scene classification method based on saliency Yu Object Bank
CN109272029B (en) Well control sparse representation large-scale spectral clustering seismic facies partitioning method
CN103456017B (en) Image partition method based on the semi-supervised weight Kernel fuzzy clustering of subset
CN109635140A (en) A kind of image search method clustered based on deep learning and density peaks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022

RJ01 Rejection of invention patent application after publication