CN113177608B

CN113177608B - Neighbor model feature selection method and device for incomplete data

Info

Publication number: CN113177608B
Application number: CN202110559948.0A
Authority: CN
Inventors: 杨伟; 王月; 葛文庚; 文云光
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2023-09-05
Anticipated expiration: 2041-05-21
Also published as: CN113177608A

Abstract

The invention discloses a neighbor model feature selection method and device for incomplete data, wherein the method comprises the following steps: step 1, initializing a feature weighting vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1; step 2, constructing a filling loss function of incomplete data based on a given characteristic weight vector w; step 3, minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm; step 4, constructing a loss function of a neighbor model feature selection method based on the filled data; step 5, optimizing the loss function in the step 4 by adopting a gradient descent method; step 6, circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached; and 7, sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset. The method and the device consider the importance of the features when calculating the filling loss, and can effectively improve the classification precision of feature selection aiming at incomplete data.

Description

Neighbor model feature selection method and device for incomplete data

Technical Field

The invention relates to the technical field of feature selection, in particular to a neighbor model feature selection method and device aiming at incomplete data.

Background

Feature selection is an effective data preprocessing technique that is widely used in the fields of image analysis, text analysis, bioinformatics, and the like. In real life, many data sets contain missing values due to machine faults or limited hardware conditions of equipment, and the like, namely high-dimensional incomplete data sets can occur in many fields. The high-dimensional incomplete data set not only increases the time and space costs of the model, but also reduces the classification accuracy of the model. Although there are a number of feature selection algorithms available today, they are mostly designed for complete data [ Tran, c.t., et al Improving performance of classification on incomplete data using feature selection and managing applied Soft Computing,2018.73:p.848-861 ].

Currently, feature selection and data population are generally considered as two separate processes [ Zhao, y. And q. Long, variable Selection in the Presence of Missing Data: motion-based methods, wiley inter-display views, computational statistics,2017.9 (5): p.e1402 ], i.e., first population of incomplete data based on mean interpolation or other criteria, and then feature selection based on the populated data. Since the importance of features is not considered in data population, this type of approach typically ignores some important features.

Disclosure of Invention

Aiming at the problem that the prior incomplete data feature selection method generally considers feature selection and data filling as two independent processes, namely the importance of the features is not considered during data filling and some important features are ignored, the invention provides a neighbor model feature selection method and device aiming at incomplete data.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention provides a neighbor model feature selection method aiming at incomplete data, which comprises the following steps:

step 1: initializing a feature weight vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;

step 2: constructing a filling loss function of incomplete data based on a given feature weight vector w;

step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data;

step 4: constructing a loss function of a neighbor model feature selection method based on the filled data;

step 5: optimizing the loss function in the step 4 by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;

step 6: circularly executing the steps 3 to 5 until the length change of the feature weight vector is smaller than a threshold value or the maximum iteration number is reached;

step 7: and sorting the features in a descending order according to the finally output feature weighting vector, so as to select an optimal feature subset.

Further, the step 2 includes:

based on the known feature weight vector w, the padding loss function of incomplete data is:

wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R ^n×r ，H∈R ^r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) _q 、G _p and H^q Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x _pq Is observable } is the index set of all observable elements in X, x= [ X ] ₁ ,x ₂ ,...,x _n ] ^T ∈R ^n×m Is a feature matrix containing missing components, x _pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter that needs to be adjusted.

Further, the step 3 includes:

the optimization problem of the padding loss function of incomplete data can be translated into the following optimization problem:

wherein the matrix Index set of elements of all missing values in X, function P _Ω (X) is defined as follows:

to solve the optimization problem in equation (1), first initialize G to a random matrix G with orthogonal columns ⁽⁰⁾ Then based on g=g at the kth iteration ^(k-1) Find H ^(k) Based on h=h ^(k) G is calculated ^(k) Until a stopping criterion is reached;

when G is fixed as G ^(k-1) When this is the case, the optimization problem in the formula (1) is converted into the following form:

wherein ,representation->Is the q-th column of (2);

and then into m independent optimization sub-problems:

for each sub-problem, it has an analytical solutionCan quickly solve H ^(k) ；

When H is fixed as H ^(k) The optimization problem in equation (1) can be converted into the following form:

wherein ,representation->P-th row of (a);

and then into n independent optimization sub-problems:

for each sub-problem, its resolution is: wherein I_r Is an identity matrix with the size of r multiplied by r;

when the stopping criterion is reached, the filling matrix z=g is output ^(k) H ^(k) 。

Further, the step 4 includes:

step 4.1 for any two samples z after filling _p And z _q Weighted Euclidean distance d between them _pq (w) is defined as:

step 4.2, test sample z _p Select sample z _q The probability as a reference point at the time of prediction is:

where k (d) =exp (-d/σ), σ is the kernel width, S _p 、D _p Respectively with sample z _p Front K with identical labels ₁ Index set of nearest neighbor samples, and sample z _p Front K with different labels ₂ Index sets of nearest neighbor samples;

step 4.3, sample z _p Probability of being mispredicted p _p The calculation formula of (w) is as follows:

step 4.4, classification errors are defined as:

step 4.5, introducing regularization term to obtain a loss function of the neighbor model feature selection method:

where lambda is a positive balance parameter,the weight of the first feature is represented.

Further, the step 5 includes:

step 5.1, solving the partial derivative of the loss function in the step 4 about the vector w;

step 5.2, updating the characteristic weight vector w by using a gradient descent method;

step 5.3, repeating steps 5.1 and 5.2 until the absolute difference of the loss function values in two adjacent iterations is less than a threshold value or the maximum number of iterations is reached;

and 5.4, outputting a characteristic weighting vector w.

The invention also provides a neighbor model feature selection device aiming at incomplete data, which comprises the following steps:

an initialization module for initializing a feature weight vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1's, where m representsThe number of features in the dataset that contain missing values;

a first construction module for constructing a padding loss function of incomplete data based on a given feature weight vector w;

a filling module for minimizing a loss function in the first building module using an alternating iterative optimization algorithm to fill missing components in incomplete data;

the second construction module is used for constructing a loss function of the neighbor model feature selection method based on the filled data;

the optimizing module is used for optimizing the loss function in the second building module by adopting a gradient descent method so as to obtain a solution of the characteristic weight vector;

the loop module is used for loop executing the filling module, the second construction module and the optimizing module until the length change of the characteristic weight vector is smaller than a threshold value or the maximum iteration number is reached;

and the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select the optimal feature subset.

Compared with the prior art, the invention has the beneficial effects that:

according to the method and the device for selecting the characteristics of the neighbor model aiming at the incomplete data, the characteristic weighting vector obtained by the method for selecting the characteristics of the neighbor model is used for calculating the filling loss, namely, the importance of the characteristics is considered when the filling loss is calculated, so that the classification precision of the characteristic selection aiming at the incomplete data can be effectively improved.

Drawings

Fig. 1 is a basic flowchart of a method for selecting a feature of a neighbor model for incomplete data according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

as shown in fig. 1, a method for selecting a feature of a neighbor model for incomplete data includes:

As an embodiment, the data set containing the missing values may be expressed as: t= (X, y), where y= [ y ] ₁ ,y ₂ ,...,y _n ] ^T ∈R ⁿ Is a label vector, x= [ X ] ₁ ,x ₂ ,...,x _n ] ^T ∈R ^n×m Is a feature matrix containing missing components, n is the total number of samples in the dataset. For the feature matrix X, let Ω= { (p, q) |x _pq Observable is an index set of all observable elements in X,is an index set of elements of all missing values in X, where X _pq Representing the elements on the p-th row, q-th column of matrix X. In addition, let Z ε R ^n×m For a numerical matrix of rank R (R < min (n, m)) obtained after filling X, the matrix can be further decomposed into the product z=gh of two low rank matrices, where G e R ^n×r ，H∈R ^r×m 。

Further, the step 2 includes:

Further, the step 3 includes:

the optimization problem of step 3 can be converted into the following optimization problem:

wherein ,representation->Is the q-th column of (2);

in particular, it can be further broken down into m independent optimization sub-problems:

for each sub-problem, it has an analytical solutionCan quickly solve H ^(k) ；

wherein ,representation->P-th row of (a);

in particular, it can be further broken down into n independent optimization sub-problems:

Further, the step 4 includes:

step 4.4, approximate leave-one method based on a neighbor model, wherein classification errors are defined as follows:

where l is a positive balance parameter,the weight of the first feature is represented. It should be noted that by using +.>The weight representing the first feature may ensure that the weight of the feature is non-negative. In particular, when using->Replace->To represent the weights of the features, the regularization term introduced is essentially L with respect to the weights ₁ Regularizing the term. Therefore, the lean solution of the feature weight can be obtained by optimizing the loss function.

Further, the step 5 includes:

and 5.4, outputting a characteristic weighting vector w.

Further, in the step 6:

the calculation formula of the length of the feature weight vector w is:

to verify the effect of the invention, the following experiments were performed:

common padding schemes for missing data sets (data sets containing missing values) are mean interpolation (AI), KNNI and zero padding (ZI). The above three filling schemes are combined with the feature selection method ReliefF to form three feature selection methods for processing missing data sets. In order to verify the effectiveness of the method, a comparison experiment is performed on 6 missing data sets between the feature selection method and the feature selection method for processing the incomplete data sets. Specifically, the 6 missing data sets are respectively: dermotology, wisconsin, mammagraphic, bands, housevelots, hepatics; in particular, the 6 missing data sets may be selected from the web sitehttps://sci2s.ugr.es/keel/ datasets.phpAnd (5) uploading and downloading.

The dermotolog consists of 366 samples and 34 features, wherein the sample class number is 6, and 8 samples contain missing values; the data set is constructed for identifying related skin diseases, and is characterized by the following values: if one of the skin disorders of the subject's family is suffering from the study, the value of the "family history" feature is 1, otherwise it is 0. The age characteristic is used to record the age of the patient, the remaining characteristics have values ranging from 0 to 3,0 indicating the absence of the characteristic, 3 indicating the maximum possible value, and 1, 2 indicating the relative median value.

The wisconsin comprises 699 samples and 9 features, the sample class number is 2, and 16 samples contain missing values; the dataset contains a case of a study on breast cancer surgery patients, the task of which is to determine whether the detected tumor is benign or malignant.

The mammagraphic comprises 961 samples and 5 features, the number of sample categories is 2, and 131 samples contain missing values; the dataset utilizes three attributes of patient age, BI-RADS assessment to predict the severity of breast mass lesions (benign or malignant).

The candidates included 539 samples and 19 features, sample class number 2, where 174 samples contained missing values; the data set is a classification problem in intaglio printing for the purpose of determining whether a given workpiece is a cylinder.

housevelots comprises 435 samples and 16 features, the number of sample categories is 2, and 203 samples contain missing values; which is a data set concerning voting problems.

The hepatitis comprises 155 samples and 19 features, the sample category number is 2, and 75 samples contain missing values; the data set is a data set concerning hepatitis diseases, and the aim is to judge whether the relevant hepatitis disease patients survive or die.

Specific information for the 6 missing data sets described above is shown in table 1.

Table 16 details of missing data sets

In the invention, 70% of each data set is randomly selected as a training data set and 30% is selected as a test data set when experiments are carried out. Furthermore, the classification accuracy of each method is calculated on the KNN classifier. The above procedure was repeated 10 times, and the average accuracy of each feature selection method was calculated. The average accuracy and standard deviation of the four feature selection methods are shown in table 2 through experiments. The bold type in table 2 indicates the optimal value for each row.

Table 2 average precision and standard deviation of four feature selection methods

As can be seen from table 2, the feature selection method of the present invention achieves the highest average classification accuracy among the above 6 missing data sets.

On the basis of the above embodiment, another aspect of the present invention further provides a neighbor model feature selection device for incomplete data, including:

an initialization module for initializing a feature weight vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1's, where m represents the number of features in the dataset that contain missing values;

Further, the first construction module is specifically configured to:

Further, the filling module is specifically configured to:

the optimization problem of the loss function in the first building block may be translated into the following optimization problem:

wherein ,representation->Is the q-th column of (2);

and then into m independent optimization sub-problems:

for each sub-problem, it has an analytical solutionCan quickly solve H ^(k) ；

wherein ,representation->P-th row of (a);

and then into n independent optimization sub-problems:

when the stopping criterion is reached, the filling moment is outputArray z=g ^(k) H ^(k) 。

Further, the second construction module is specifically configured to:

step 4.4, classification errors are defined as:

Further, the optimization module is specifically configured to:

and 5.4, outputting a characteristic weighting vector w.

In summary, the feature weighting vector obtained by the neighbor model feature selection method is used in calculation of the filling loss, namely, importance of the features is considered when the filling loss is calculated, so that classification precision of feature selection of incomplete data can be effectively improved.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. A method of neighbor model feature selection for incomplete data in a dermatology dataset, comprising:

step 1: initializing a feature weight vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in question is in a family of subjects, the value of the family history feature is 1, otherwiseFor 0, the age characteristic is used for recording the age of the patient, the value range of the other characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate relative intermediate values;

step 2: constructing a filling loss function of incomplete data in the dermatolog based on a given feature weight vector w;

step 3: minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the dermotolog;

step 7: the features are ordered in descending order according to the finally output feature weighting vector, so that an optimal feature subset is selected;

the step 2 comprises the following steps:

based on the known feature weight vector w, the padding loss function of incomplete data in the demarcated is:

wherein G, H is a low rank matrix with rank R, G∈R ^n×r ，H∈R ^r×m R < min (n, m), n being the total number of samples in the dermotolog; w (w) _q 、G _p and H^q Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x _pq Is observable } is the index set of all observable elements in X, x= [ X ] ₁ ,x ₂ ,...,x _n ] ^T ∈R ^n×m Is a feature matrix containing missing components corresponding to the dermotolog, x _pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a super parameter to be adjusted；

The step 4 comprises the following steps:

step 4.4, classification errors are defined as:

2. A method for selecting features of a neighbor model for incomplete data in a dermatolog dataset according to claim 1, wherein said step 3 comprises:

the optimization problem of the padding loss function for incomplete data in dermotolog can be translated into the following optimization problem:

wherein the matrixIndex set of elements of all missing values in X, function P _Ω (X) is defined as follows:

wherein ,representation->Is the q-th column of (2);

and then into m independent optimization sub-problems:

for each sub-problem, it has an analytical solutionCan quickly solve H ^(k) ；

wherein ,representation->P-th row of (a);

and then into n independent optimization sub-problems:

3. A method for selecting features of a neighbor model for incomplete data in a dermatology dataset according to claim 1, wherein said step 5 comprises:

and 5.4, outputting a characteristic weighting vector w.

4. A neighbor model feature selection apparatus for incomplete data in a dermatology dataset, comprising:

an initialization module for initializing a feature weight vector w= [ w ] ₁ ,w ₂ ,...,w _m ]∈R ^m A vector of all 1's, where m represents the number of features in the dematology dataset; the dermotolog data set constructed for identifying the skin disease of interest consists of 366 samples and 34 features, wherein 8 samples contain missing values; if one of the skin diseases in the study is in the family of the study object, the value of the family history characteristic is 1, otherwise, the value is 0, the age characteristic is used for recording the age of the patient, the value range of the rest characteristics is 0-3, 0 indicates that the characteristic does not exist, 3 indicates the maximum value, and 1 and 2 indicate the relative intermediate value;

a first construction module, configured to construct a padding loss function of incomplete data in the dermatolog based on a given feature weight vector w;

a filling module for minimizing the loss function in the step 2 by adopting an alternate iterative optimization algorithm so as to fill missing components in incomplete data in the demarcatoslog;

the sorting output module is used for sorting the features in a descending order according to the finally output feature weighting vector so as to select an optimal feature subset;

the first construction module is specifically configured to:

wherein G, H is a low-rank matrix with rank R (the product of G and H is the characteristic matrix filled with missing components), G∈R ^n×r ，H∈R ^r×m R < min (n, m), n being the total number of samples in the dataset containing missing values; w (w) _q 、G _p and H^q Respectively representing the q-th element of w, the p-th row of G and the q-th column of H; Ω= { (p, q) |x _pq Is observable } is the index set of all observable elements in X, x= [ X ] ₁ ,x ₂ ,...,x _n ] ^T ∈R ^n×m Is a feature matrix containing missing components, x _pq Representing elements on the p-th row, q-th column of matrix X; beta>0 is a hyper-parameter to be adjusted;

the second construction module is specifically configured to:

step 4.4, classification errors are defined as: