CN113177608B - Neighbor model feature selection method and device for incomplete data - Google Patents

Neighbor model feature selection method and device for incomplete data Download PDF

Info

Publication number
CN113177608B
CN113177608B CN202110559948.0A CN202110559948A CN113177608B CN 113177608 B CN113177608 B CN 113177608B CN 202110559948 A CN202110559948 A CN 202110559948A CN 113177608 B CN113177608 B CN 113177608B
Authority
CN
China
Prior art keywords
feature
loss function
incomplete data
samples
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110559948.0A
Other languages
Chinese (zh)
Other versions
CN113177608A (en
Inventor
杨伟
王月
葛文庚
文云光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202110559948.0A priority Critical patent/CN113177608B/en
Publication of CN113177608A publication Critical patent/CN113177608A/en
Application granted granted Critical
Publication of CN113177608B publication Critical patent/CN113177608B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a neighbor model feature selection method and device for incomplete data, wherein the method comprises the following steps: step 1, initializing a feature weighting vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1; step 2, constructing a filling loss function of incomplete data based on a given characteristic weight vector w; step 3, minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm; step 4, constructing a loss function of a neighbor model feature selection method based on the filled data; step 5, optimizing the loss function in the step 4 by adopting a gradient descent method; step 6, circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached; and 7, sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset. The method and the device consider the importance of the features when calculating the filling loss, and can effectively improve the classification precision of feature selection aiming at incomplete data.

Description

Neighbor model feature selection method and device for incomplete data
Technical Field
The invention relates to the technical field of feature selection, in particular to a neighbor model feature selection method and device aiming at incomplete data.
Background
Feature selection is an effective data preprocessing technique that is widely used in the fields of image analysis, text analysis, bioinformatics, and the like. In real life, many data sets contain missing values due to machine faults or limited hardware conditions of equipment, and the like, namely high-dimensional incomplete data sets can occur in many fields. The high-dimensional incomplete data set not only increases the time and space costs of the model, but also reduces the classification accuracy of the model. Although there are a number of feature selection algorithms available today, they are mostly designed for complete data [ Tran, c.t., et al Improving performance of classification on incomplete data using feature selection and managing applied Soft Computing,2018.73:p.848-861 ].
Currently, feature selection and data population are generally considered as two separate processes [ Zhao, y. And q. Long, variable Selection in the Presence of Missing Data: motion-based methods, wiley inter-display views, computational statistics,2017.9 (5): p.e1402 ], i.e., first population of incomplete data based on mean interpolation or other criteria, and then feature selection based on the populated data. Since the importance of features is not considered in data population, this type of approach typically ignores some important features.
Disclosure of Invention
Aiming at the problem that the prior incomplete data feature selection method generally considers feature selection and data filling as two independent processes, namely the importance of the features is not considered during data filling and some important features are ignored, the invention provides a neighbor model feature selection method and device aiming at incomplete data.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention provides a neighbor model feature selection method aiming at incomplete data, which comprises the following steps:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
step 2: constructing a filling loss function of incomplete data based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: and sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset.
Further, the step 2 includes:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the step 3 includes:
the optimization problem of the padding loss function of incomplete data can be translated into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k)
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k)
Further, the step 4 includes:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
Further, the step 5 includes:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
The invention also provides a neighbor model feature selection device aiming at incomplete data, which comprises the following steps:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m representsThe number of features in the dataset that contain missing values;
a first construction module for constructing a padding loss function of incomplete data based on a given feature weight vector w;
a filling module for minimizing a loss function in the first building module using an alternating iterative optimization algorithm to fill missing components in incomplete data;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
and the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select the optimal feature subset.
Compared with the prior art, the invention has the beneficial effects that:
according to the method and the device for selecting the characteristics of the neighbor model aiming at the incomplete data, the characteristic weighting vector obtained by the method for selecting the characteristics of the neighbor model is used for calculating the filling loss, namely, the importance of the characteristics is considered when the filling loss is calculated, so that the classification precision of the characteristic selection aiming at the incomplete data can be effectively improved.
Drawings
Fig. 1 is a basic flowchart of a method for selecting a feature of a neighbor model for incomplete data according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:
as shown in fig. 1, a method for selecting a feature of a neighbor model for incomplete data includes:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
step 2: constructing a filling loss function of incomplete data based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: and sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset.
As an embodiment, the data set containing the missing values may be expressed as: t= (X, y), where y= [ y ] 1 ,y 2 ,...,y n ] T ∈R n Is a label vector, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, n is the total number of samples in the dataset. For the feature matrix X, let Ω= { (p, q) |x pq Observable is an index set of all observable elements in X,is an index set of elements of all missing values in X, where X pq Representing the elements on the p-th row, q-th column of matrix X. In addition, let Z ε R n×m For a numerical matrix of rank R (R < min (n, m)) obtained after filling X, the matrix can be further decomposed into the product z=gh of two low rank matrices, where G e R n×r ,H∈R r×m
Further, the step 2 includes:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the step 3 includes:
the optimization problem of step 3 can be converted into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
in particular, it can be further broken down into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k)
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
in particular, it can be further broken down into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k)
Further, the step 4 includes:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, approximate leave-one method based on a neighbor model, wherein classification errors are defined as follows:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where l is a positive balance parameter,the weight of the first feature is represented. It should be noted that by using +.>The weight representing the first feature may ensure that the weight of the feature is non-negative. In particular, when using->Replace->To represent the weights of the features, the regularization term introduced is essentially L with respect to the weights 1 Regularizing the term. Therefore, the lean solution of the feature weight can be obtained by optimizing the loss function.
Further, the step 5 includes:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
Further, in the step 6:
the calculation formula of the length of the feature weight vector w is:
to verify the effect of the invention, the following experiments were performed:
common padding schemes for missing data sets (data sets containing missing values) are mean interpolation (AI), KNNI and zero padding (ZI). The above three filling schemes are combined with the feature selection method ReliefF to form three feature selection methods for processing missing data sets. In order to verify the effectiveness of the method, a comparison experiment is performed on 6 missing data sets between the feature selection method and the feature selection method for processing the incomplete data sets. Specifically, the 6 missing data sets are respectively: dermotology, wisconsin, mammagraphic, bands, housevelots, hepatics; in particular, the 6 missing data sets may be selected from the web sitehttps://sci2s.ugr.es/keel/ datasets.phpAnd (5) uploading and downloading.
The dermotolog consists of 366 samples and 34 features, wherein the sample class number is 6, and 8 samples contain missing values; the data set is constructed for identifying related skin diseases, and is characterized by the following values: if one of the skin disorders of the subject's family is suffering from the study, the value of the "family history" feature is 1, otherwise it is 0. The age characteristic is used to record the age of the patient, the remaining characteristics have values ranging from 0 to 3,0 indicating the absence of the characteristic, 3 indicating the maximum possible value, and 1, 2 indicating the relative median value.
The wisconsin comprises 699 samples and 9 features, the sample class number is 2, and 16 samples contain missing values; the dataset contains a case of a study on breast cancer surgery patients, the task of which is to determine whether the detected tumor is benign or malignant.
The mammagraphic comprises 961 samples and 5 features, the number of sample categories is 2, and 131 samples contain missing values; the dataset utilizes three attributes of patient age, BI-RADS assessment to predict the severity of breast mass lesions (benign or malignant).
The candidates included 539 samples and 19 features, sample class number 2, where 174 samples contained missing values; the data set is a classification problem in intaglio printing for the purpose of determining whether a given workpiece is a cylinder.
housevelots comprises 435 samples and 16 features, the number of sample categories is 2, and 203 samples contain missing values; which is a data set concerning voting problems.
The hepatitis comprises 155 samples and 19 features, the sample category number is 2, and 75 samples contain missing values; the data set is a data set concerning hepatitis diseases, and the aim is to judge whether the relevant hepatitis disease patients survive or die.
Specific information for the 6 missing data sets described above is shown in table 1.
Table 16 details of missing data sets
In the invention, 70% of each data set is randomly selected as a training data set and 30% is selected as a test data set when experiments are carried out. Furthermore, the classification accuracy of each method is calculated on the KNN classifier. The above procedure was repeated 10 times, and the average accuracy of each feature selection method was calculated. The average accuracy and standard deviation of the four feature selection methods are shown in table 2 through experiments. The bold type in table 2 indicates the optimal value for each row.
Table 2 average precision and standard deviation of four feature selection methods
As can be seen from table 2, the feature selection method of the present invention achieves the highest average classification accuracy among the above 6 missing data sets.
On the basis of the above embodiment, another aspect of the present invention further provides a neighbor model feature selection device for incomplete data, including:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;
a first construction module for constructing a padding loss function of incomplete data based on a given feature weight vector w;
a filling module for minimizing a loss function in the first building module using an alternating iterative optimization algorithm to fill missing components in incomplete data;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
and the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select the optimal feature subset.
Further, the first construction module is specifically configured to:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.
Further, the filling module is specifically configured to:
the optimization problem of the loss function in the first building block may be translated into the following optimization problem:
wherein the matrix Index set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k)
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling moment is outputArray z=g (k) H (k)
Further, the second construction module is specifically configured to:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
Further, the optimization module is specifically configured to:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
In summary, the feature weighting vector obtained by the neighbor model feature selection method is used in calculation of the filling loss, namely, importance of the features is considered when the filling loss is calculated, so that classification precision of feature selection of incomplete data can be effectively improved.
The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims (4)

1. A method of neighbor model feature selection for incomplete data in a dermatology dataset, comprising:
step 1: initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in question is in a family of subjects, the value of the family history feature is 1, otherwiseFor 0, the age characteristic is used for recording the age of the patient, the value range of the other characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate relative intermediate values;
step 2: constructing a filling loss function of incomplete data in the dermatolog based on a given feature weight vector w;
step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the dermotolog;
step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;
step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;
step 7: the features are ordered in descending order according to the finally output feature weighting vector, so that an optimal feature subset is selected;
the step 2 comprises the following steps:
based on the known feature weight vector w, the padding loss function of incomplete data in the demarcated is:
wherein G, H is a low rank matrix with rank R, G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dermotolog; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components corresponding to the dermotolog, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a super parameter to be adjusted;
The step 4 comprises the following steps:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
2. A method for selecting features of a neighbor model for incomplete data in a dermatolog dataset according to claim 1, wherein said step 3 comprises:
the optimization problem of the padding loss function for incomplete data in dermotolog can be translated into the following optimization problem:
wherein the matrixIndex set of elements of all missing values in X, function P Ω (X) is defined as follows:
to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns (0) Then based on g=g at the kth iteration (k-1) Find H (k) Based on h=h (k) G is calculated (k) Until a stopping criterion is reached;
when G is fixed as G (k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:
wherein ,representation->Is the q-th column of (2);
and then into m independent optimization sub-problems:
for each sub-problem, it has an analytical solutionCan quickly solve H (k)
When H is fixed as H (k) The optimization problem in equation (1) can be converted into the following form:
wherein ,representation->P-th row of (a);
and then into n independent optimization sub-problems:
for each sub-problem, its resolution is: wherein Ir Is an identity matrix with the size of r multiplied by r;
when the stopping criterion is reached, the filling matrix z=g is output (k) H (k)
3. A method for selecting features of a neighbor model for incomplete data in a dermatology dataset according to claim 1, wherein said step 5 comprises:
step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;
step 5.2, updating the characteristic weight vector w by using a gradient descent method;
step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;
and 5.4, outputting a characteristic weighting vector w.
4. A neighbor model feature selection apparatus for incomplete data in a dermatology dataset, comprising:
an initialization module for initializing a feature weight vector w= [ w ] 1 ,w 2 ,...,w m ]∈R m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in the study is in the family of the study object, the value of the family history characteristic is 1, otherwise, the value is 0, the age characteristic is used for recording the age of the patient, the value range of the rest characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate the relative intermediate value;
a first construction module, configured to construct a padding loss function of incomplete data in the dermatolog based on a given feature weight vector w;
a filling module for minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the demarcatoslog;
the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;
the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;
the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;
the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select an optimal feature subset;
the first construction module is specifically configured to:
based on the known feature weight vector w, the padding loss function of incomplete data is:
wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R n×r ,H∈R r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) q 、G p and Hq Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x pq Is observable } is the index set of all observable elements in X, x= [ X ] 1 ,x 2 ,...,x n ] T ∈R n×m Is a feature matrix containing missing components, x pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter to be adjusted;
the second construction module is specifically configured to:
step 4.1 for any two samples z after filling p And z q Weighted Euclidean distance d between them pq (w) is defined as:
step 4.2, test sample z p Select sample z q The probability as a reference point at the time of prediction is:
where k (d) =exp (-d/σ), σ is the kernel width, S p 、D p Respectively with sample z p Front K with identical labels 1 Index set of nearest neighbor samples, and sample z p Front K with different labels 2 Index sets of nearest neighbor samples;
step 4.3, sample z p Probability of being mispredicted p p The calculation formula of (w) is as follows:
step 4.4, classification errors are defined as:
step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:
where lambda is a positive balance parameter,the weight of the first feature is represented.
CN202110559948.0A 2021-05-21 2021-05-21 Neighbor model feature selection method and device for incomplete data Active CN113177608B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110559948.0A CN113177608B (en) 2021-05-21 2021-05-21 Neighbor model feature selection method and device for incomplete data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110559948.0A CN113177608B (en) 2021-05-21 2021-05-21 Neighbor model feature selection method and device for incomplete data

Publications (2)

Publication Number Publication Date
CN113177608A CN113177608A (en) 2021-07-27
CN113177608B true CN113177608B (en) 2023-09-05

Family

ID=76929611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110559948.0A Active CN113177608B (en) 2021-05-21 2021-05-21 Neighbor model feature selection method and device for incomplete data

Country Status (1)

Country Link
CN (1) CN113177608B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435906B (en) * 2023-12-18 2024-03-12 湖南行必达网联科技有限公司 New energy automobile configuration feature selection method based on cross entropy

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091038A (en) * 2013-04-01 2014-10-08 太原理工大学 Method for weighting multiple example studying features based on master space classifying criterion
CN107193876A (en) * 2017-04-21 2017-09-22 美林数据技术股份有限公司 A kind of missing data complementing method based on arest neighbors KNN algorithms
CN107220346A (en) * 2017-05-27 2017-09-29 荣科科技股份有限公司 A kind of higher-dimension deficiency of data feature selection approach
CN107818328A (en) * 2016-09-14 2018-03-20 南京航空航天大学 With reference to the deficiency of data similitude depicting method of local message
CN108446735A (en) * 2018-03-06 2018-08-24 宁波大学 A kind of feature selection approach optimizing neighbour's constituent analysis based on differential evolution
CN108765517A (en) * 2018-04-18 2018-11-06 天津大学 A kind of multiple amount vision data fill methods based on convex optimization
CN110188812A (en) * 2019-05-24 2019-08-30 长沙理工大学 A kind of multicore clustering method of quick processing missing isomeric data
CN110705762A (en) * 2019-09-20 2020-01-17 天津大学 Ubiquitous power Internet of things perception data missing repairing method based on matrix filling

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262721B2 (en) * 2012-11-14 2016-02-16 Repsol, S.A. Automatically selecting analogous members for new population members based on incomplete descriptions, including an uncertainty characterzing selection
US10430928B2 (en) * 2014-10-23 2019-10-01 Cal Poly Corporation Iterated geometric harmonics for data imputation and reconstruction of missing data
US20200193220A1 (en) * 2018-12-18 2020-06-18 National Sun Yat-Sen University Method for data imputation and classification and system for data imputation and classification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104091038A (en) * 2013-04-01 2014-10-08 太原理工大学 Method for weighting multiple example studying features based on master space classifying criterion
CN107818328A (en) * 2016-09-14 2018-03-20 南京航空航天大学 With reference to the deficiency of data similitude depicting method of local message
CN107193876A (en) * 2017-04-21 2017-09-22 美林数据技术股份有限公司 A kind of missing data complementing method based on arest neighbors KNN algorithms
CN107220346A (en) * 2017-05-27 2017-09-29 荣科科技股份有限公司 A kind of higher-dimension deficiency of data feature selection approach
CN108446735A (en) * 2018-03-06 2018-08-24 宁波大学 A kind of feature selection approach optimizing neighbour's constituent analysis based on differential evolution
CN108765517A (en) * 2018-04-18 2018-11-06 天津大学 A kind of multiple amount vision data fill methods based on convex optimization
CN110188812A (en) * 2019-05-24 2019-08-30 长沙理工大学 A kind of multicore clustering method of quick processing missing isomeric data
CN110705762A (en) * 2019-09-20 2020-01-17 天津大学 Ubiquitous power Internet of things perception data missing repairing method based on matrix filling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Missing data imputation by K nearest neighbours based on grey relational structure and mutual information;Ruilin Pan, et.al;《Applied intelligence》;第43卷;第614-632页 *

Also Published As

Publication number Publication date
CN113177608A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
Troyanskaya et al. Missing value estimation methods for DNA microarrays
Lin et al. Group sparse canonical correlation analysis for genomic data integration
Witten et al. A framework for feature selection in clustering
Sun et al. Local-learning-based feature selection for high-dimensional data analysis
Yang Machine learning approaches to bioinformatics
McWilliams et al. Subspace clustering of high-dimensional data: a predictive approach
Sathya et al. [Retracted] Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes
US8775345B2 (en) Recovering the structure of sparse markov networks from high-dimensional data
US20240029834A1 (en) Drug Optimization by Active Learning
Mohammed et al. Evaluation of partitioning around medoids algorithm with various distances on microarray data
Ma Variable selection with copula entropy
CN116741397A (en) Cancer typing method, system and storage medium based on multi-group data fusion
CN110910325B (en) Medical image processing method and device based on artificial butterfly optimization algorithm
CN112906767A (en) Unsupervised feature selection method based on hidden space learning and popular constraint
Batmanghelich et al. Diversifying sparsity using variational determinantal point processes
CN113177608B (en) Neighbor model feature selection method and device for incomplete data
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN112967755A (en) Cell type identification method for single cell RNA sequencing data
CN110414562B (en) X-ray film classification method, device, terminal and storage medium
Yu et al. Nonstationary Gaussian process discriminant analysis with variable selection for high-dimensional functional data
Scrucca et al. Projection pursuit based on Gaussian mixtures and evolutionary algorithms
CN116403701A (en) Method and device for predicting TMB level of non-small cell lung cancer patient
Yang et al. Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis
CN110739028B (en) Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
Mariño et al. Two weighted c-medoids batch SOM algorithms for dissimilarity data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant