CN113177608B - Neighbor model feature selection method and device for incomplete data - Google Patents
Neighbor model feature selection method and device for incomplete data Download PDFInfo
- Publication number
- CN113177608B CN113177608B CN202110559948.0A CN202110559948A CN113177608B CN 113177608 B CN113177608 B CN 113177608B CN 202110559948 A CN202110559948 A CN 202110559948A CN 113177608 B CN113177608 B CN 113177608B
- Authority
- CN
- China
- Prior art keywords
- feature
- loss function
- incomplete data
- samples
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 28
- 238000005457 optimization Methods 0.000 claims abstract description 36
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000011478 gradient descent method Methods 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 39
- 238000010276 construction Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 7
- 208000017520 skin disease Diseases 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000002474 experimental method Methods 0.000 description 4
- 208000006454 hepatitis Diseases 0.000 description 3
- 231100000283 hepatitis Toxicity 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000003211 malignant effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 1
- 206010006272 Breast mass Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 241000196323 Marchantiophyta Species 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002440 hepatic effect Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a neighbor model feature selection method and device for incomplete data, wherein the method comprises the following steps: step 1, initializing a feature weighting vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1; step 2, constructing a filling loss function of incomplete data based on a given characteristic weight vector w; step 3, minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm; step 4, constructing a loss function of a neighbor model feature selection method based on the filled data; step 5, optimizing the loss function in the step 4 by adopting a gradient descent method; step 6, circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached; and 7, sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset. The method and the device consider the importance of the features when calculating the filling loss, and can effectively improve the classification precision of feature selection aiming at incomplete data.
Description
Technical Field
The invention relates to the technical field of feature selection, in particular to a neighbor model feature selection method and device aiming at incomplete data.
Background
Feature selection is an effective data preprocessing technique that is widely used in the fields of image analysis, text analysis, bioinformatics, and the like. In real life, many data sets contain missing values due to machine faults or limited hardware conditions of equipment, and the like, namely high-dimensional incomplete data sets can occur in many fields. The high-dimensional incomplete data set not only increases the time and space costs of the model, but also reduces the classification accuracy of the model. Although there are a number of feature selection algorithms available today, they are mostly designed for complete data [ Tran, c.t., et al Improving performance of classification on incomplete data using feature selection and managing applied Soft Computing,2018.73:p.848-861 ].
Currently, feature selection and data population are generally considered as two separate processes [ Zhao, y. And q. Long, variable Selection in the Presence of Missing Data: motion-based methods, wiley inter-display views, computational statistics,2017.9 (5): p.e1402 ], i.e., first population of incomplete data based on mean interpolation or other criteria, and then feature selection based on the populated data. Since the importance of features is not considered in data population, this type of approach typically ignores some important features.
Disclosure of Invention
Aiming at the problem that the prior incomplete data feature selection method generally considers feature selection and data filling as two independent processes, namely the importance of the features is not considered during data filling and some important features are ignored, the invention provides a neighbor model feature selection method and device aiming at incomplete data.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention provides a neighbor model feature selection method aiming at incomplete data, which comprises the following steps:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
step 2: constructing a filling loss function of incomplete data based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: and sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset.
Further, the step 2 includes:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the step 3 includes:
the optimization problem of the padding loss function of incomplete data can be translated into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k) ;
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k) 。
Further, the step 4 includes:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
Further, the step 5 includes:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
The invention also provides a neighbor model feature selection device aiming at incomplete data, which comprises the following steps:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m representsThe number of features in the dataset that contain missing values;
a first construction module for constructing a padding loss function of incomplete data based on a given feature weight vector w;
a filling module for minimizing a loss function in the first building module using an alternating iterative optimization algorithm to fill missing components in incomplete data;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
and the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select the optimal feature subset.
Compared with the prior art, the invention has the beneficial effects that:
according to the method and the device for selecting the characteristics of the neighbor model aiming at the incomplete data, the characteristic weighting vector obtained by the method for selecting the characteristics of the neighbor model is used for calculating the filling loss, namely, the importance of the characteristics is considered when the filling loss is calculated, so that the classification precision of the characteristic selection aiming at the incomplete data can be effectively improved.
Drawings
Fig. 1 is a basic flowchart of a method for selecting a feature of a neighbor model for incomplete data according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:
as shown in fig. 1, a method for selecting a feature of a neighbor model for incomplete data includes:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
step 2: constructing a filling loss function of incomplete data based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: and sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset.
As an embodiment, the data set containing the missing values may be expressed as: t= (X, y), where y= [ y ] 1 ,y 2 ,...,y n ] T ∈R n Is a label vector, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, n is the total number of samples in the dataset. For the feature matrix X, let Ω= { (p, q) |x pq Observable is an index set of all observable elements in X,is an index set of elements of all missing values in X, where X pq Representing the elements on the p-th row, q-th column of matrix X. In addition, let Z ε R n×m For a numerical matrix of rank R (R < min (n, m)) obtained after filling X, the matrix can be further decomposed into the product z=gh of two low rank matrices, where G e R n×r ,H∈R r×m 。
Further, the step 2 includes:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the step 3 includes:
the optimization problem of step 3 can be converted into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
in particular, it can be further broken down into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k) ;
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
in particular, it can be further broken down into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k) 。
Further, the step 4 includes:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, approximate leave-one method based on a neighbor model, wherein classification errors are defined as follows:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where l is a positive balance parameter,the weight of the first feature is represented. It should be noted that by using +.>The weight representing the first feature may ensure that the weight of the feature is non-negative. In particular, when using->Replace->To represent the weights of the features, the regularization term introduced is essentially L with respect to the weights 1 Regularizing the term. Therefore, the lean solution of the feature weight can be obtained by optimizing the loss function.
Further, the step 5 includes:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
Further, in the step 6:
the calculation formula of the length of the feature weight vector w is:
to verify the effect of the invention, the following experiments were performed:
common padding schemes for missing data sets (data sets containing missing values) are mean interpolation (AI), KNNI and zero padding (ZI). The above three filling schemes are combined with the feature selection method ReliefF to form three feature selection methods for processing missing data sets. In order to verify the effectiveness of the method, a comparison experiment is performed on 6 missing data sets between the feature selection method and the feature selection method for processing the incomplete data sets. Specifically, the 6 missing data sets are respectively: dermotology, wisconsin, mammagraphic, bands, housevelots, hepatics; in particular, the 6 missing data sets may be selected from the web sitehttps://sci2s.ugr.es/keel/ datasets.phpAnd (5) uploading and downloading.
The dermotolog consists of 366 samples and 34 features, wherein the sample class number is 6, and 8 samples contain missing values; the data set is constructed for identifying related skin diseases, and is characterized by the following values: if one of the skin disorders of the subject's family is suffering from the study, the value of the "family history" feature is 1, otherwise it is 0. The age characteristic is used to record the age of the patient, the remaining characteristics have values ranging from 0 to 3,0 indicating the absence of the characteristic, 3 indicating the maximum possible value, and 1, 2 indicating the relative median value.
The wisconsin comprises 699 samples and 9 features, the sample class number is 2, and 16 samples contain missing values; the dataset contains a case of a study on breast cancer surgery patients, the task of which is to determine whether the detected tumor is benign or malignant.
The mammagraphic comprises 961 samples and 5 features, the number of sample categories is 2, and 131 samples contain missing values; the dataset utilizes three attributes of patient age, BI-RADS assessment to predict the severity of breast mass lesions (benign or malignant).
The candidates included 539 samples and 19 features, sample class number 2, where 174 samples contained missing values; the data set is a classification problem in intaglio printing for the purpose of determining whether a given workpiece is a cylinder.
housevelots comprises 435 samples and 16 features, the number of sample categories is 2, and 203 samples contain missing values; which is a data set concerning voting problems.
The hepatitis comprises 155 samples and 19 features, the sample category number is 2, and 75 samples contain missing values; the data set is a data set concerning hepatitis diseases, and the aim is to judge whether the relevant hepatitis disease patients survive or die.
Specific information for the 6 missing data sets described above is shown in table 1.
Table 16 details of missing data sets
In the invention, 70% of each data set is randomly selected as a training data set and 30% is selected as a test data set when experiments are carried out. Furthermore, the classification accuracy of each method is calculated on the KNN classifier. The above procedure was repeated 10 times, and the average accuracy of each feature selection method was calculated. The average accuracy and standard deviation of the four feature selection methods are shown in table 2 through experiments. The bold type in table 2 indicates the optimal value for each row.
Table 2 average precision and standard deviation of four feature selection methods
As can be seen from table 2, the feature selection method of the present invention achieves the highest average classification accuracy among the above 6 missing data sets.
On the basis of the above embodiment, another aspect of the present invention further provides a neighbor model feature selection device for incomplete data, including:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
a first construction module for constructing a padding loss function of incomplete data based on a given feature weight vector w;
a filling module for minimizing a loss function in the first building module using an alternating iterative optimization algorithm to fill missing components in incomplete data;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
and the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select the optimal feature subset.
Further, the first construction module is specifically configured to:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the filling module is specifically configured to:
the optimization problem of the loss function in the first building block may be translated into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k) ;
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling moment is outputArray z=g (k) H (k) 。
Further, the second construction module is specifically configured to:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
Further, the optimization module is specifically configured to:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
In summary, the feature weighting vector obtained by the neighbor model feature selection method is used in calculation of the filling loss, namely, importance of the features is considered when the filling loss is calculated, so that classification precision of feature selection of incomplete data can be effectively improved.
The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.
Claims (4)
1. A method of neighbor model feature selection for incomplete data in a dermatology dataset, comprising:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in question is in a family of subjects, the value of the family history feature is 1, otherwiseFor 0, the age characteristic is used for recording the age of the patient, the value range of the other characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate relative intermediate values;
step 2: constructing a filling loss function of incomplete data in the dermatolog based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the dermotolog;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: the features are ordered in descending order according to the finally output feature weighting vector, so that an optimal feature subset is selected;
the step 2 comprises the following steps:
based on the known feature weight vector w, the padding loss function of incomplete data in the demarcated is:
wherein G, H is a low rank matrix with rank R, G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dermotolog; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components corresponding to the dermotolog, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a super parameter to be adjusted;
The step 4 comprises the following steps:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
2. A method for selecting features of a neighbor model for incomplete data in a dermatolog dataset according to claim 1, wherein said step 3 comprises:
the optimization problem of the padding loss function for incomplete data in dermotolog can be translated into the following optimization problem:
wherein the matrixIndex set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k) ;
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k) 。
3. A method for selecting features of a neighbor model for incomplete data in a dermatology dataset according to claim 1, wherein said step 5 comprises:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
4. A neighbor model feature selection apparatus for incomplete data in a dermatology dataset, comprising:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in the study is in the family of the study object, the value of the family history characteristic is 1, otherwise, the value is 0, the age characteristic is used for recording the age of the patient, the value range of the rest characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate the relative intermediate value;
a first construction module, configured to construct a padding loss function of incomplete data in the dermatolog based on a given feature weight vector w;
a filling module for minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the demarcatoslog;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select an optimal feature subset;
the first construction module is specifically configured to:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter to be adjusted;
the second construction module is specifically configured to:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110559948.0A CN113177608B (en) | 2021-05-21 | 2021-05-21 | Neighbor model feature selection method and device for incomplete data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110559948.0A CN113177608B (en) | 2021-05-21 | 2021-05-21 | Neighbor model feature selection method and device for incomplete data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113177608A CN113177608A (en) | 2021-07-27 |
CN113177608B true CN113177608B (en) | 2023-09-05 |
Family
ID=76929611
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110559948.0A Active CN113177608B (en) | 2021-05-21 | 2021-05-21 | Neighbor model feature selection method and device for incomplete data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177608B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117435906B (en) * | 2023-12-18 | 2024-03-12 | 湖南行必达网联科技有限公司 | New energy automobile configuration feature selection method based on cross entropy |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104091038A (en) * | 2013-04-01 | 2014-10-08 | 太原理工大学 | Method for weighting multiple example studying features based on master space classifying criterion |
CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
CN107220346A (en) * | 2017-05-27 | 2017-09-29 | 荣科科技股份有限公司 | A kind of higher-dimension deficiency of data feature selection approach |
CN107818328A (en) * | 2016-09-14 | 2018-03-20 | 南京航空航天大学 | With reference to the deficiency of data similitude depicting method of local message |
CN108446735A (en) * | 2018-03-06 | 2018-08-24 | 宁波大学 | A kind of feature selection approach optimizing neighbour's constituent analysis based on differential evolution |
CN108765517A (en) * | 2018-04-18 | 2018-11-06 | 天津大学 | A kind of multiple amount vision data fill methods based on convex optimization |
CN110188812A (en) * | 2019-05-24 | 2019-08-30 | 长沙理工大学 | A kind of multicore clustering method of quick processing missing isomeric data |
CN110705762A (en) * | 2019-09-20 | 2020-01-17 | 天津大学 | Ubiquitous power Internet of things perception data missing repairing method based on matrix filling |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9262721B2 (en) * | 2012-11-14 | 2016-02-16 | Repsol, S.A. | Automatically selecting analogous members for new population members based on incomplete descriptions, including an uncertainty characterzing selection |
US10430928B2 (en) * | 2014-10-23 | 2019-10-01 | Cal Poly Corporation | Iterated geometric harmonics for data imputation and reconstruction of missing data |
US20200193220A1 (en) * | 2018-12-18 | 2020-06-18 | National Sun Yat-Sen University | Method for data imputation and classification and system for data imputation and classification |
-
2021
- 2021-05-21 CN CN202110559948.0A patent/CN113177608B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104091038A (en) * | 2013-04-01 | 2014-10-08 | 太原理工大学 | Method for weighting multiple example studying features based on master space classifying criterion |
CN107818328A (en) * | 2016-09-14 | 2018-03-20 | 南京航空航天大学 | With reference to the deficiency of data similitude depicting method of local message |
CN107193876A (en) * | 2017-04-21 | 2017-09-22 | 美林数据技术股份有限公司 | A kind of missing data complementing method based on arest neighbors KNN algorithms |
CN107220346A (en) * | 2017-05-27 | 2017-09-29 | 荣科科技股份有限公司 | A kind of higher-dimension deficiency of data feature selection approach |
CN108446735A (en) * | 2018-03-06 | 2018-08-24 | 宁波大学 | A kind of feature selection approach optimizing neighbour's constituent analysis based on differential evolution |
CN108765517A (en) * | 2018-04-18 | 2018-11-06 | 天津大学 | A kind of multiple amount vision data fill methods based on convex optimization |
CN110188812A (en) * | 2019-05-24 | 2019-08-30 | 长沙理工大学 | A kind of multicore clustering method of quick processing missing isomeric data |
CN110705762A (en) * | 2019-09-20 | 2020-01-17 | 天津大学 | Ubiquitous power Internet of things perception data missing repairing method based on matrix filling |
Non-Patent Citations (1)
Title |
---|
Missing data imputation by K nearest neighbours based on grey relational structure and mutual information;Ruilin Pan, et.al;《Applied intelligence》;第43卷;第614-632页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113177608A (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Troyanskaya et al. | Missing value estimation methods for DNA microarrays | |
Lin et al. | Group sparse canonical correlation analysis for genomic data integration | |
Witten et al. | A framework for feature selection in clustering | |
Sun et al. | Local-learning-based feature selection for high-dimensional data analysis | |
Yang | Machine learning approaches to bioinformatics | |
McWilliams et al. | Subspace clustering of high-dimensional data: a predictive approach | |
Sathya et al. | [Retracted] Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes | |
US8775345B2 (en) | Recovering the structure of sparse markov networks from high-dimensional data | |
US20240029834A1 (en) | Drug Optimization by Active Learning | |
Mohammed et al. | Evaluation of partitioning around medoids algorithm with various distances on microarray data | |
Ma | Variable selection with copula entropy | |
CN116741397A (en) | Cancer typing method, system and storage medium based on multi-group data fusion | |
CN110910325B (en) | Medical image processing method and device based on artificial butterfly optimization algorithm | |
CN112906767A (en) | Unsupervised feature selection method based on hidden space learning and popular constraint | |
Batmanghelich et al. | Diversifying sparsity using variational determinantal point processes | |
CN113177608B (en) | Neighbor model feature selection method and device for incomplete data | |
CN111048145B (en) | Method, apparatus, device and storage medium for generating protein prediction model | |
CN112967755A (en) | Cell type identification method for single cell RNA sequencing data | |
CN110414562B (en) | X-ray film classification method, device, terminal and storage medium | |
Yu et al. | Nonstationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data | |
Scrucca et al. | Projection pursuit based on Gaussian mixtures and evolutionary algorithms | |
CN116403701A (en) | Method and device for predicting TMB level of non-small cell lung cancer patient | |
Yang et al. | Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis | |
CN110739028B (en) | Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition | |
Mariño et al. | Two weighted c-medoids batch SOM algorithms for dissimilarity data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |