CN110364264A

CN110364264A - Medical data collection feature dimension reduction method based on sub-space learning

Info

Publication number: CN110364264A
Application number: CN201910546805.9A
Authority: CN
Inventors: 庾安妮; 徐雷
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-10-22

Abstract

The invention discloses a kind of medical data collection feature dimension reduction method based on sub-space learning, method includes the following steps: constructing original high dimensional data matrix X and label column according to medical data collection to be analyzed；Optimization objective function is constructed, its Lagrangian is solved；According to original high dimensional data matrix and label column, global discriminant information and local discriminant information are calculated；Transformed matrix Q is iteratively solved, until objective function convergence or reaches maximum cycle, the data matrix after obtaining dimensionality reduction；According to the transformed matrix training pattern acquired, AUC value evaluation dimensionality reduction matrix and classification accuracy are found out.Compared to the feature dimension reduction method of current medical data collection, method of the invention carries out dimensionality reduction using the local discriminant information of data and global discriminant information simultaneously, the Feature Dimension Reduction problem being applicable not only under general scale, the classification accuracy still with higher when the Feature-scale of data is much larger than sample size.

Description

Medical data collection feature dimension reduction method based on sub-space learning

Technical field

The invention belongs to big data technology and machine learning field, especially a kind of medical data based on sub-space learning Collect feature dimension reduction method.

Background technique

Feature Dimension Reduction (Dimensionality Reduction) is intended to high dimensional data being converted to low-dimensional data.Feature drop The appearance of dimension technology is due to can all generate the high dimension of large amount of complex in the Machine Learning Problems that practical application scene generates According to.The runing time of most of data analysis tasks at least linearly increases with the increase of data dimension, stores, analyzes high dimension According to needing to consume a large amount of computer storage resources, many calculating times are spent.And times of many data minings and machine learning Business, such as classification, cluster and recurrence have only obtained effect in lower dimensional space, can be extremely difficult if being placed into higher dimensional space.So How Feature Dimension Reduction is carried out to high dimensional data, and keeping important information not lose is an extremely urgent problem.

Broadly, whether provided according to the label information of data, feature dimension reduction method can be roughly divided into supervision, half prison It superintends and directs and unsupervised three categories.Sub-space learning (Subspace Learning) belongs to a kind of linear feature dimension reduction method, that is, recognizes It can be indicated by the linear transformation of feature vector for " inherent dimension " (the Intrinsic Dimensionality) of data.It is such Typical method has principal component analysis (Principal Component Analysis, PCA), linear discriminant analysis (Linear Discriminant Analysis, LDA) and locality preserving projections (Locality Preserving Projection, LPP) Deng.

But all there are some drawbacks in the existing method based on sub-space learning.If the target of PCA is to pass through maximization Original high dimensional data linear expression is low-dimensional data by the covariance matrix of data；LDA is then usually the form of mark ratio, it Pass through and maximizes class scatter matrix and minimize Scatter Matrix in class simultaneously to solve the expression of feature vector.PCA and LDA are Eigenvalues Decomposition is carried out based on spectral method, the only discriminant information from the point of view of the overall situation, such as the variance of PCA and the mean value of LDA, The discriminant information provided by sample neighborhood of a point is provided.When sample size is much smaller than intrinsic dimensionality, LDA method is likely to occur Singular matrix leads to calculated feature vector inaccuracy.And LPP is on the contrary, it passes through the adjacent map of construction sample point, so After calculate weight, maintain the linear structure of neighbor domain of node, but do not account for the importance of global discriminant information, thus classify It is ineffective.In addition, existing some feature dimension reduction methods, if PCA and LDA is the feature dimension reduction method of printenv, and all Assuming that the distribution Gaussian distributed of sample point, thus it is very sensitive to outlier.

Currently, the problem of carrying out Feature Dimension Reduction to high dimensional data is modeled as optimization problem, method for solving is often related to And Eigenvalues Decomposition, but thering is document to point out, the optimal solution of certain problems cannot be solved by Eigenvalues Decomposition.Alternating direction multiplier Method (Alternating Direction Method of Multipliers, ADMM) is suitable for solving convex optimization problem, and counts It calculates efficiently, fast convergence rate, is the emphasis of current area research.

Summary of the invention

The purpose of the present invention is to provide it is a kind of be simple and efficient, the Feature Dimension Reduction that fast convergence rate and classification accuracy are high Method.

The technical solution for realizing the aim of the invention is as follows: the medical data collection Feature Dimension Reduction side based on sub-space learning Method, comprising the following steps:

Step 1 constructs original high dimensional data matrix X and label column according to medical data collection to be analyzed；

Step 2, construction optimization objective function, solve its Lagrangian, and initialize the ginseng of Lagrangian Several and variable；

Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix that medical data concentrates sample S_bWith Scatter Matrix S in class_w, thus to obtain global discriminant information；

Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sample_b With adjacent map G in class_w, and corresponding Laplacian matrix L is sought respectively_bAnd L_w, thus to obtain local discriminant information；

Step 5, in conjunction with the S of above-mentioned solution_b、S_w、G_w、G_bLagrangian is iterated, seeks reconstructing for dimensionality reduction The transformed matrix Q and P of original high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function convergence or The maximum cycle for reaching setting, the matrix Q finally obtained^TX is the data matrix after dimensionality reduction；

Step 6 trains classifier using transformed matrix Q, later according to the AUC value of classifier to matrix Q^TX is evaluated.

Compared with prior art, the present invention its remarkable advantage are as follows: 1) joint is believed using global discriminant information and local discriminant Breath carries out dimensionality reduction, still can achieve high classification accuracy when sample characteristics dimension is much larger than sample number；2) it utilizesModel Number regularization term is convenient for feature selecting, and the model robustness trained is good, not vulnerable to outlier so that feature is sparse Interference；3) different from the current method of many printenvs, the adjustable parameter of feature dimension reduction method proposed by the present invention, so that instruction The model practised adapts to some specific tasks, and experiments have shown that the selection method of suitable parameter simply can be achieved；4) it proposes It is very efficient for the method for solving of optimization of the invention, and experiments have shown that model fast convergence rate；5) classification accuracy It is higher.

Present invention is further described in detail with reference to the accompanying drawing.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of the medical data collection feature dimension reduction method of sub-space learning.

Fig. 2 is the convergence curve graph in the embodiment of the present invention.

Fig. 3 is the parameter selection figure in the embodiment of the present invention.

Specific embodiment

In conjunction with Fig. 1, the medical data collection feature dimension reduction method of the invention based on sub-space learning, comprising the following steps:

Further, original high dimensional data matrix and label column, tool are constructed according to medical data collection to be analyzed in step 1 Body are as follows:

Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character Dimension；The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data square except the part after first row Battle array, uses matrixIt indicates；I-th row of the data matrix indicates observed value of i-th of sample under all features, jth Column indicate all observed values of j-th of feature.

Further, optimization objective function is constructed in step 1, solves its Lagrangian, and it is bright to initialize glug The parameter and variable of day function, specifically:

Generally, linear character dimensionality reduction is modeled as following optimization problem: for original high dimensional data setThe target of Feature Dimension Reduction is exactly to seek a transformed matrix Original high dimensional data is mapped as to new low-dimensional dataWherein, y_i=A^Tx_i。

For intuitively, if different classes of data are more dispersed, the data of the same category more polymerize, then classifier energy Preferably distinguish the classification of data.Disperse between this type and the interior degree polymerizeing of class is respectively by divergence in class scatter matrix and class Matrix is measured.The present invention starts in terms of global discriminant information and local discriminant information two, on the one hand minimizes divergence in class The mark of matrix maximizes the mark of class scatter matrix, on the other hand utilizes Laplacian matrix, the low-dimensional data after keeping conversion Local distribution it is consistent with the distribution of original high dimensional data.

Step 2-1, (1) is directed to global discriminant information, and target is to minimize inter- object distance, while maximizing between class distance, Construct optimization objective function specifically:

(2) be directed to local discriminant information, target be to maintain conversion after low-dimensional data local manifolds structure and original height Consistent, the primal objective function of construction of dimension data are as follows:

By y_i=Q^Tx_i, by objective function reduction are as follows:

In summary (1) (2), and increase regularization term and restrictive condition, obtain optimization objective function are as follows:

S.t.X=PQ^TX+E,PP^T=I

In formula,For original high dimensional data matrix, wherein n is total sample number, and m is primitive character dimension, higher-dimension I-th row of data matrix indicates observed value of i-th of sample under all features, and jth column indicate all observations of j-th of feature Value；It is transition matrix, for reconstructing original high dimensional data matrix, wherein d is low after reducing Tie up dimension；Tr (*) is the mark of " * ", ‖ * ‖_2,1For " * "Norm, ‖ Q ‖_2,1For compensating transformed error, ‖ * ‖₁For " * " Norm,For random error matrix,For unit matrix；Parameter alpha, β, λ₁And λ₂It is positive real number；S_bFor sample This class scatter matrix, S_wFor Scatter Matrix in sample class, L_bThe adjacent map G between the class of sample_bCorresponding Laplacian matrix, L_wFor adjacent map G in the class of sample_wCorresponding Laplacian matrix, the first part that "+" number separates is for seeking global differentiation Information, the second part that "+" number separates seek local discriminant information；

Step 2-2, the LagrangianL of optimization objective function is sought_ρ(P, Q, E, Y) are as follows:

In formula, ρ is punishment parameter, and ρ > 0, Y are Lagrange multiplier；

Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized₁=0.01, λ₂∈ {1,10^-1,10^-2,10^-3,10^-4,10^-5, ρ=0.1, μ=0.1, ρ_max=10⁵；Iter is the number of iterations；Q=0, E=0, Y= 0, P=PCA (X), wherein PCA (*) indicates that the principal component coefficient of return matrix *, P=PCA (X) are orthogonal for initializing one Matrix P.

Further, the class that medical data concentrates sample is sought according to original high dimensional data matrix and label column in step 3 Between Scatter Matrix S_bWith Scatter Matrix S in class_w, specifically:

Seek S_bAnd S_wFormula be respectively as follows:

In formula,Represent the mean vector of total sample, wherein the original high dimensional data matrix of n includes Total number of samples amount；C is the classification number of original high dimensional data matrix, the corresponding classification of one of label；x_jFor j-th of sample This, i.e., the jth row data of original high dimensional data matrix；c_iFor i-th of classification；And if only if the label of j-th sample and i-th When the label of classification is consistent, x_j∈c_iValue be true；n_iIt is the total sample number of the i-th class, wherein the identical sample of label belongs to together It is a kind of；x_kFor k-th of sample；X_iBelong to the sample set of the i-th class for classification；And if only if the label of k-th sample and i-th When the label of classification is consistent, x_j∈X_iValue be true；For the mean vector of the i-th class sample.

Further, adjacent map G between the class of sample is sought according to original high dimensional data matrix and label column in step 4_bWith Adjacent map G in class_wCorresponding Laplacian matrix L_bAnd L_w, specifically:

Seek L_wAnd L_bFormula be respectively as follows:

L_w=D_w-W_w

L_b=D_b-W_b

In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. D_ii=∑_jW_ij, wherein W_ij For the i-th row jth column element of adjacent map weight matrix；W_wAnd W_bAdjacent map between the weight matrix and class of adjacent map respectively in class Weight matrix, be respectively as follows:

In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer；Knn (*) is segmented again: knn_w(*) is neighbour identical with sample point " * " label set, knn_b(*) is that the neighbour different from sample point " * " label gathers.

Further, the S of above-mentioned solution is combined in step 5_b、S_w、G_w、G_bLagrangian is iterated, seeks using Transformed matrix Q and P, error matrix E and the Lagrange multiplier Y of original high dimensional data matrix are reconstructed in dimensionality reduction, until target letter Number convergence or the maximum cycle for reaching setting, specifically:

In order to solve Lagrangian, r=X-PQ is enabled^TX-E is residual error, then gathers complete square, then Lagrangian letter Number L_ρ(P, Q, E, Y's)It can abbreviation are as follows:

In formula, enableThen final Lagrangian abbreviation are as follows:

(1) iteration updates the formula of Q are as follows:

It enablesSet partial derivativeThen further obtain the formula for updating Q are as follows:

Q=(2 (S_w-αS_b+βX(L_w-L_b)X^T)+λ₁U+ρXX^T)^-1(ρR^TPX)

In formula, U is diagonal matrix, i-th of diagonal element are as follows:

In formula, qⁱFor the i-th row data of matrix Q；

(2) iteration updates the formula of P are as follows:

It enablesAnd the constant term in cancelling, it obtains:

Solve RX^TThe SVD value of Q is RX^TQ=USV^T, then the formula for updating P is further obtained are as follows:

P=UV^T

(3) iteration updates the formula of E are as follows:

It enablesFurther obtain the formula for updating E are as follows:

E=shrink (E₀,e)

In formula, shrink indicates convergence operator, specifically:

sign(E₀)max(|E₀|-e,0)

(4) iteration updates the formula of Y and ρ are as follows:

Y=Y+ ρ (X-PQ^TX-E)

ρ=min (ρ_max,μρ)

In formula, ρ_maxIt is constant predetermined with μ.

Further, using transformed matrix Q training classifier in step 6, later according to the AUC value of classifier to matrix Q^TX is evaluated, specifically:

Matrix Q^TThe corresponding medical data of X is concentrated, and the sample label of positive example indicates that the sample label of counter-example is by -1 table by+1 Show；

Seek the AUC value of classifier:

In formula, num_posAnd num_negThe respectively quantity of positive and negative samples, I (P_pos,P_neg) are as follows:

In formula, P_posIt is the probability of positive example, P for classifier forecast sample_negIt is the probability of counter-example for classifier forecast sample；

The AUC value sought is higher, and presentation class effect is better, that is, the transformed matrix Q sought is better.

Exemplary implement body of preferably, in step 6 classifying uses KNN classifier.

Below with reference to embodiment, the present invention is described in further detail.

Embodiment

The present invention is based on the medical data collection feature dimension reduction methods of sub-space learning, including the following contents:

1, original high dimensional data matrix X and label column are constructed according to medical data collection to be analyzed.

The data set used in the present embodiment is ARCENE data set, serum mass spectrum of this data set from the mankind. The sample size of ARCENE data set is 900, and characteristic dimension is up to 10000.The task is two classification problems, it is intended to be distinguished People's (label is+1) and normal person with cancer (label is -1).Entire data set is by two prostate cancer data sets and one Oophoroma data set merges, data set from National Cancer Institute (National Cancer Institute, ) and Eastern Virginia Medical School (Eastern Virginian Medical School, EVMS) NCI.Data do not have missing values, And about 44% sample is positive example.Data set is made of three parts: the training dataset with 100 samples, 100 samples This validation data set, the test data set of 700 samples.It is default to set 10 for d.

2, optimization objective function is constructed, Lagrangian, and the variable and parameter of initialization algorithm are constructed:

The objective function used in the present embodiment is as follows:

S.t.X=PQ^TX+E,PP^T=I

The Lagrangian of above formula are as follows:

In formula, ρ (ρ > 0) is punishment parameter, and Y is Lagrange multiplier.

The assignment of variable, parameter to be initiated is as shown in table 1 below:

1 variable of table and parameter initialization

3, according to original high dimensional data matrix and label column, class scatter matrix S is calculated_bWith Scatter Matrix S in class_w, obtain Global discriminant information:

Calculation formula is respectively as follows:

The S found out in the present embodiment_bAnd S_wIt is respectively as follows:

4, according to original high dimensional data matrix and label column, adjacent map G between the class of medical data concentration sample is constructed_bAnd class Interior adjacent map G_w, and corresponding Laplacian matrix L is sought respectively_bAnd L_w, obtain local discriminant information.

The L found out in the present embodiment_bAnd L_wIt is sparse matrix, is respectively as follows:

5, iterate to calculate transformed matrix Q and P, error matrix E and Lagrange multiplier Y, until objective function convergence or Reach maximum cycle:

(1) Q is found out by following formula:

Q=(2 (S_w-αS_b+βX(L_w-L_b)X^T)+λ₁U+ρXX^T)^-1(ρR^TPX)

The Q acquired in the present embodiment are as follows:

(2) P is found out by following formula:

P=UV^T

The P that the present embodiment is found out are as follows:

(3) E is found out by following formula:

E=shrink (E₀, e) and the E that finds out in the present embodiment are as follows:

(4) Y is found out by following formula:

Y=Y+ ρ (X-PQ^TX-E) the Y found out in the present embodiment are as follows:

6, using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix Q^TX is evaluated.

Convergence:

As shown in Fig. 2, abscissa is the number of iterations, the present embodiment changes the convergence curve graph that the present embodiment obtains Generation number parameter is set as 100, and left ordinate is classification accuracy (being indicated with the form of percentage), and right ordinate is target letter Several values.Find that model is no more than 30 times in iteration and reaches convergence by the operation result of embodiment.

Parameter selection:

There are two preset parameter lambdas for the present embodiment₁And λ₂, the setting of the two parameters is related to established optimization mould The convergence of type.The method that parameter selection is given below:

The present embodiment is from candidate collection { 10^-5,10^-4,10^-3,10^-2,10^-1,1,10¹,10²,10³,10⁴,10⁵In selection not Same λ₁And λ₂Combination of two carry out training pattern, while recording the AUC value under various combination, as a result as shown in Figure 3.By implementing The operation result discovery of example, for ARCENE data set, works as λ₁And λ₂When taking 0.01 and 0.1 respectively, the classification accuracy of model is most It is high.For different data sets, optimal λ₁And λ₂Value is not exactly the same, through experiments, it was found that, the selection of the parameter can be used Control variate method, when the value of the one of parameter of fixation, so that the value interval of the optimal another parameter of algorithm effect is also not difficult It obtains.

The present invention realizes the Feature Dimension Reduction to medical data collection, is believed by calculating global discriminant information and local discriminant Breath, the optimization objective function proposed according to the present invention, recursive resolve transformed matrix is to reach to the optimal of original high dimensional data Low-dimensional linear expression.It is demonstrated experimentally that being compared to the feature dimension reduction method of current medical data collection, method of the invention is not only It is still with higher when the Feature-scale of data is much larger than sample size suitable for the Feature Dimension Reduction problem under general scale Classification accuracy.In addition, the adjustable parameter of the present invention, so that the model trained adapts to some specific tasks, calculation method Very efficiently, and it is insensitive for outlier.

Claims

1. a kind of medical data collection feature dimension reduction method based on sub-space learning, which comprises the following steps:

Step 2, construction optimization objective function, solve its Lagrangian, and initialize Lagrangian parameter and Variable；

Step 3, according to original high dimensional data matrix and label column, seek the class scatter matrix S that medical data concentrates sample_bWith Scatter Matrix S in class_w, thus to obtain global discriminant information；

Step 4, according to original high dimensional data matrix and label column, construct medical data and concentrate adjacent map G between the class of sample_bAnd class Interior adjacent map G_w, and corresponding Laplacian matrix L is sought respectively_bAnd L_w, thus to obtain local discriminant information；

Step 5, in conjunction with the S of above-mentioned solution_b、S_w、G_w、G_bLagrangian is iterated, seeks reconstructing for dimensionality reduction original The transformed matrix Q and P of high dimensional data matrix, error matrix E and Lagrange multiplier Y, until objective function is restrained or is reached The maximum cycle of setting, the matrix Q finally obtained^TX is the data matrix after dimensionality reduction；

2. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step Rapid 1 it is described original high dimensional data matrix and label column are constructed according to medical data collection to be analyzed, specifically:

Assuming that constructing original high dimensional data matrixN is medical data collection total sample number, and m is primitive character dimension； The first of matrix M is classified as label column, uses vectorIt indicates, matrix M is data matrix except the part after first row, is used MatrixIt indicates；I-th row of the data matrix indicates observed value of i-th of sample under all features, jth list Show all observed values of j-th of feature.

3. the medical data collection feature dimension reduction method according to claim 1 or 2 based on sub-space learning, feature exist In constructing optimization objective function described in step 1, solve its Lagrangian, and initialize the parameter of Lagrangian And variable, specifically:

Step 2-1, construction optimization objective function is as follows:

S.t.X=PQ^TX+E,PP^T=I

In formula,For original high dimensional data matrix, wherein n is the total sample number that original high dimensional data matrix includes, m For primitive character dimension, the i-th row of high dimensional data matrix indicates observed value of i-th of sample under all features, and jth column indicate All observed values of j-th of feature；It is transition matrix, for reconstructing original high dimensional data matrix, Wherein, d is the low-dimensional dimension after reduction；Tr (*) is the mark of " * ", ‖ * ‖_2,1For the l of " * "_2,1Norm, ‖ Q ‖_2,1It is converted for compensating Error, ‖ * ‖₁For the l of " * "₁Norm,For random error matrix,For unit matrix；Parameter alpha, β, λ₁And λ₂ It is positive real number；S_bFor sample class scatter matrix, S_wFor Scatter Matrix in sample class, L_bThe adjacent map G between the class of sample_bIt is right The Laplacian matrix answered, L_wFor adjacent map G in the class of sample_wCorresponding Laplacian matrix, first of "+" number separation Divide for seeking global discriminant information, the second part that "+" number separates seeks local discriminant information；

Step 2-3, the parameter and variable of Lagrangian: α=0.01, β=0.02, λ are initialized₁=0.01, λ₂∈{1,10^-1,10^-2,10^-3,10^-4,10^-5, ρ=0.1, μ=0.1, ρ_max=10⁵；Iter is the number of iterations；Q=0, E=0, Y=0, P= PCA (X), wherein PCA (*) indicates the principal component coefficient of return matrix *, and P=PCA (X) is for initializing an orthogonal matrix P.

4. the medical data collection feature dimension reduction method according to claim 3 based on sub-space learning, which is characterized in that step Rapid 3 is described according to original high dimensional data matrix and label column, seeks the class scatter matrix S that medical data concentrates sample_bAnd class Interior Scatter Matrix S_w, specifically:

Seek S_bAnd S_wFormula be respectively as follows:

In formula,Represent the mean vector of total sample, wherein n is the sample that original high dimensional data matrix includes Sum；C is the classification number of original high dimensional data matrix, the corresponding classification of one of label；x_jIt is for j-th of sample, i.e., former The jth row data of beginning high dimensional data matrix；c_iFor i-th of classification；And if only if the label and i-th classification of j-th sample When label is consistent, x_j∈c_iValue be true；n_iIt is the total sample number of the i-th class, wherein the identical sample of label belongs to same class； x_kFor k-th of sample；X_iBelong to the sample set of the i-th class for classification；And if only if the label and i-th classification of k-th sample When label is consistent, x_j∈X_iValue be true；For the mean vector of the i-th class sample.

5. the medical data collection feature dimension reduction method according to claim 4 based on sub-space learning, which is characterized in that step Rapid 4 is described according to original high dimensional data matrix and label column, seeks adjacent map G between the class of sample_bWith adjacent map G in class_wAccordingly Laplacian matrix L_bAnd L_w, specifically:

Seek L_wAnd L_bFormula be respectively as follows:

L_w=D_w-W_w

L_b=D_b-W_b

In formula, D is diagonal matrix, and diagonal entry is by row summation to W as a result, i.e. D_ii=∑_jW_ij, wherein W_ijFor adjoining I-th row jth column element of figure weight matrix；W_wAnd W_bRespectively in class between the weight matrix and class of adjacent map adjacent map weight Matrix is respectively as follows:

In formula, knn (*) indicates the k neighbour set of sample point " * ", and k is customized positive integer；Knn (*) is segmented again: knn_w(*) For neighbour identical with sample point " * " label set, knn_b(*) is that the neighbour different from sample point " * " label gathers.

6. the medical data collection feature dimension reduction method according to claim 5 based on sub-space learning, which is characterized in that step The S of the rapid 5 above-mentioned solution of combination_b、S_w、G_w、G_bLagrangian is iterated, seeks reconstructing original height for dimensionality reduction The transformed matrix Q and P of dimension data matrix, error matrix E and Lagrange multiplier Y restrain or reach setting until objective function Fixed maximum cycle, specifically:

(1) iteration updates the formula of Q are as follows:

Q=(2 (S_w-αS_b+βX(L_w-L_b)X^T)+λ₁U+ρXX^T)^-1(ρR^TPX)

In formula, U is diagonal matrix, i-th of diagonal element are as follows:

In formula, qⁱFor the i-th row data of matrix Q；

(2) iteration updates the formula of P are as follows:

It enablesAnd the constant term in cancelling, it obtains:

P=UV^T

(3) iteration updates the formula of E are as follows:

It enablesFurther obtain the formula for updating E are as follows:

E=shrink (E₀,e)

In formula, shrink indicates convergence operator, specifically:

sign(E₀)max(|E₀|-e,0)

(4) iteration updates the formula of Y and ρ are as follows:

Y=Y+ ρ (X-PQ^TX-E)

ρ=min (ρ_max,μρ)

In formula, ρ_maxIt is constant predetermined with μ.

7. the medical data collection feature dimension reduction method according to claim 1 based on sub-space learning, which is characterized in that step Rapid 6 is described using transformed matrix Q training classifier, later according to the AUC value of classifier to matrix Q^TX is evaluated, specifically:

Matrix Q^TThe corresponding medical data of X is concentrated, and the sample label of positive example is indicated by+1, and the sample label of counter-example is indicated by -1；

Seek the AUC value of classifier:

8. the medical data collection feature dimension reduction method according to claim 7 based on sub-space learning, which is characterized in that institute Classification implement body is stated using KNN classifier.