CN113761307A

CN113761307A - Feature selection method and device

Info

Publication number: CN113761307A
Application number: CN202011351263.9A
Authority: CN
Inventors: 祖辰; 杨立军
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-12-07

Abstract

The invention discloses a feature selection method and device, and relates to the technical field of computers. One embodiment of the method comprises: determining features in each sample, and constructing a sample vector based on values of all the features in each sample; processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features. The global redundancy minimized sparse feature selection algorithm GRMS provided in this embodiment minimizes the redundancy of global features, selects features with strong discriminability, and implements correction on sparsely selected features.

Description

Feature selection method and device

Technical Field

The invention relates to the technical field of computers, in particular to a feature selection method and device.

Background

With the development of information technology, global data is growing explosively, and more data needs to be stored and spread. To deal with a huge amount of high-dimensional data, it is necessary to extract the most informative and valuable information therefrom. Since high-dimensional data contains a large number of features, the features inevitably contain noise, in this case, feature selection becomes an indispensable data mining technique, and performance such as subsequent classification or clustering can be improved by feature dimension reduction.

The current feature selection methods are classified into three types, a filtering method, a wrapping method and an embedded method. In the process of implementing the invention, the inventor finds that the methods generally do not consider the redundancy among the selected features, so the features have high relevance and are not beneficial to the tasks of subsequent clustering or classification and the like. Although a mutual information-based mRMR (Max-reservance and Min-Redundancy) feature selection method has been proposed to minimize Redundancy between features, the mRMR method uses a greedy strategy to find features with minimum Redundancy, resulting in the selected features not having global Redundancy information minimization.

Disclosure of Invention

In view of this, embodiments of the present invention provide a feature selection method and apparatus, which can at least solve the problem that the redundancy of global features cannot be minimized and features with strong discriminability cannot be selected in the prior art.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a feature selection method including:

determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;

processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;

and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.

Optionally, the processing the sample vector by using a sparse feature selection algorithm with global redundancy minimization includes:

determining the category of each sample, and inputting the sample vector, the category, the first importance score vector, the second importance score vector and the redundancy matrix into a global redundancy minimized sparse feature selection algorithm for minimization; wherein the first importance score vector does not consider redundancy and the second importance score vector does consider redundancy; and

constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 1;

a second importance score vector is derived after the elements are minimized and constrained.

Optionally, the processing the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector includes:

processing the sample vector through an evaluation criterion to calculate a first importance score when redundancy is not considered by each feature, and generating a first importance score vector;

and introducing a redundancy matrix, and inputting the first importance scoring vector into a redundancy information minimization criterion to perform redundancy removal correction on different features to obtain a second importance vector.

Optionally, the method further includes:

normalizing values of a feature in different samples to construct a feature vector of the feature;

and calculating the inner product between the feature vectors of every two features to construct a redundant matrix between every two features.

Optionally, the normalizing the values of a feature in different samples to construct a feature vector of the feature includes:

constructing a data matrix according to the values of a feature in different samples;

and processing the data matrix by using a centralization matrix to obtain a centralized feature vector corresponding to the feature.

To achieve the above object, according to another aspect of an embodiment of the present invention, there is provided a feature selection apparatus including:

the characteristic determining module is used for determining the characteristics in each sample and constructing a sample vector based on the values of all the characteristics in each sample; wherein, the sample is at least one of image, text and voice data;

the vector processing module is used for processing the sample vector by utilizing a global redundancy minimized sparse feature selection algorithm to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;

and the feature selection module is used for extracting a predetermined number of features with the maximum importance scores from the importance scores of the features to serve as the selected target features.

Optionally, the vector processing module is configured to:

constraining elements in the second importance score vector; wherein the constraint is that the elements are not 0 and the sum is 6;

Optionally, the vector processing module is configured to:

Optionally, the system further includes a redundancy matrix constructing module, configured to:

Optionally, the redundancy matrix constructing module is configured to:

To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a feature selection electronic device.

The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any of the feature selection methods described above.

To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing any of the feature selection methods described above.

According to the scheme provided by the invention, one embodiment of the invention has the following advantages or beneficial effects: and describing which non-redundant features are described based on the redundant matrix, using norm regularization for data feature selection, and minimizing and unifying feature selection and global redundant information into a convex objective optimization function so as to effectively realize feature selection.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIGS. 1(a) - (c) are schematic diagrams of three types of prior methods for selecting features, namely, wrapped, filtered and embedded;

FIG. 2 is a schematic diagram of an mRMR feature selection;

FIG. 3 is a schematic flow chart diagram of a feature selection method according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a sparse-based global redundancy information minimization method;

FIG. 5 is a schematic diagram of the main blocks of a feature selection apparatus according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 7 is a schematic block diagram of a computer system suitable for use with a mobile device or server implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Feature selection has been widely used in many different fields, such as selection of disease causing genes in medical research and image segmentation in computer vision. The use of a small number of features with very rich representation capability can not only accelerate the learning process of the model, but also enhance the generalization capability of the model on unknown data.

The current feature selection method mainly adopts three modes, and the advantages and disadvantages of the method are analyzed as follows:

1) the filtering feature selection method (as shown in fig. 1 (a)) selects important features according to the ranking by ranking the features by means of the inherent properties of data. Notably, the filtering approach is classifier independent, so it is computationally less expensive, but generally less performing.

2) The wrapped feature selection method (as shown in fig. 1 (b)) is associated with a specific classifier, and therefore the method can generally achieve better performance, but is not suitable for large-scale data due to high computational cost.

3) The learning process of the embedded feature selection method (as shown in fig. 1 (c)) is already embedded into the classification model training, and the features to be retained (i.e., the selected features) are finally obtained by solving the optimization problem of the classification algorithm.

Therefore, compared with the filtering type feature selection method and the wrapping type feature selection method, the embedded feature selection method can simultaneously obtain lower calculation cost and satisfactory feature selection performance.

Referring to fig. 2, a schematic diagram of rmr feature selection is shown. There are four different features on the left, with feature importance scores of: 0.95, 0.91, 0.72 and 0.33. Feature 1 has the highest importance score and has relevance to feature 2 and feature 3. Feature 2 and feature 3 may represent data from different angles, respectively, so feature 2 and feature 3 are not related to each other. When the mRMR method selects the above 4 features, since the

features

2 and 3 have correlation with the feature 1 at the same time, the

features

1 and 4 are finally selected and the

features

2 and 3 are ignored. In fact, the combination of

features

2 and 3 has a stronger discriminating power than

features

1 and 4. The mRMR method cannot minimize the redundancy of global features and cannot select features with strong discriminability (such as feature 2 and feature 3).

Referring to fig. 3, a main flowchart of a feature selection method provided by an embodiment of the present invention is shown, including the following steps:

s301: determining features in each sample, and constructing a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;

s302: processing the sample vector by using a sparse feature selection algorithm with minimum global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;

s303: and extracting a predetermined number of features with the largest importance scores from the importance scores of the features to serve as the selected target features.

In the above embodiment, for steps S301 to S303, since the features in the low-dimensional data are fewer, the present solution is only applied to feature selection of high-dimensional data (such as voice/image/text), and the high-dimensional data generally refers to a case where the number of features is much larger than the number of samples.

Redundancy means that two or more features provide duplicate information, and thus redundant information minimization can be defined: if both features i and j are related to each other, then one is preferably retained and the other is discarded during feature selection, thereby minimizing the redundant information of the final selected feature.

Generally, by reducing redundant features and using more informative features, better performance results can be achieved in subsequent tasks (e.g., clustering, classification). For example, in order to distinguish a specific person in a face recognition task, it is desirable to select some non-redundant features having discriminant and representativeness, such as eyes, nose, mouth, and the like, rather than the non-representative redundant features, such as hair and the like. The motivation of the feature selection method for minimizing global redundancy proposed by the scheme is to select representative and non-redundant features in the feature set of the sample.

Given n samples { x₁,x₂,...,x_nUsing a sample matrix

Representing a sample vector, where the k-th row represents the k-th sample

d is the number of features. Taking a face picture as an example, the features of the face picture can be wavelet features, histogram features, gray features and the like extracted according to the face picture.

The sparse feature Selection algorithm (GRMS) with minimized Global Redundancy provided by the scheme is as follows:

and calculating to obtain w and s, wherein the scheme mainly uses the s selection characteristic. Wherein the content of the first and second substances,

the first importance score vector is a regression coefficient matrix, corresponds to the first importance score vector without considering the characteristic redundancy, and is also the characteristic importance weight generated by the sparse characteristic selection algorithm; s is a second importance score vector that takes redundancy into account;

representing the full 1 vector of size d, with γ, β, and λ being regularization parameters used to balance the weights of the terms in the formula. It is worth noting that to avoid the optimization formula falling into trivial solution, the elements in s, are introduced and 1 constrained^T1 is 1 (the elements are not 0 and the sum is 1).

The first item in the GRMS regresses the training sample to the label (corresponding to y, belonging to the category of the sample) through a regression coefficient, and measures a regression value and a true value through a square loss function; the second term is a norm of the regression coefficient w, and a sparse effect can be generated, that is, a plurality of rows of elements in w obtained by final solution are all 0, and non-0 rows correspond to the retained features, so that a feature selection function is realized. The global feature redundancy information in the last two instant drawing feature selections is minimized.

In the above formula, A is a redundancy matrix, and the redundancy matrix is introduced

And describing the redundancy of every two characteristics. Given sample vector

The ith and jth featuresAre respectively X⁽ⁱ⁾、X^(j)(i, j ═ 1, 2.., d). The following centered features can be obtained accordingly:

wherein the content of the first and second substances,

in order to centralize the matrix, the matrix is,

and

representing the feature vector after the ith feature and the jth feature are centralized. In actual operation, the centralization matrix is not considered, and the value of one feature in different samples is directly normalized.

The redundancy matrix a may be calculated as follows:

it can be seen that the redundancy matrix

Wherein

Being the hadamard product, the above formula can therefore be written in matrix form as follows: b ═ DF^TFD＝(FD)^TFD, wherein F ═ F₁,f₂,...,f_d]D is a diagonal matrix with diagonal elements of 1/| |

f

_i1,2, …, d. Meanwhile, the matrix B can be determined to be a semi-positive definite matrix.

It is assumed that the ith and jth features are highly correlated, and the absolute value of the correlation coefficient is relatively large whether the correlation is positive or negative. To penalize high correlations, a squared cosine similarity is used to measure the correlation between two features, since B is a semi-positive definite matrix, it is clear that the redundancy matrix a is non-negative and also semi-positive.

By the sparse feature selection algorithm, the global feature redundancy can be minimized, and meanwhile, the weight consistency of the features before and after correction is ensured. In particular, if feature i is correlated with all other features, the first term of the sparse feature selection algorithm will result in a revised weight score s for feature i_iBecomes smaller and thus reduces the importance of the feature i. On the other hand, if feature i is not correlated with all other features, the revised weight score s for feature i_iWill be scored with the initial feature weights w_iThe consistency is maintained, i.e. the weight scores of the features i remain unchanged.

Taking the face recognition application as an example, it is assumed that the features extracted from the original face data include gray level features, histogram features, fourier features, gradient features, and the like, the matrix X is composed of the above features of different pictures, and y is a category vector corresponding to different people. And (3) calculating a redundancy matrix A according to the data matrix X, then inputting the matrix X, the category vector y and the redundancy matrix A into a GRMS formula to finally obtain a correction weight scoring vector s of different characteristics in 4, wherein the larger the numerical value in s is, the higher the corresponding characteristic weight of the element is.

Referring to the redundancy matrix shown in fig. 4, the darker the color reflects the stronger the correlation (i.e., redundancy) between features. It is worth noting that feature selection and feature redundancy minimization are solved for optimization simultaneously.

The method can also be considered from another aspect, for example, firstly, on the basis of the sample vector, calculating a first importance score of each feature in the feature set when the redundancy is not considered through an evaluation criterion, such as an extreme tree algorithm and a random forest, and constructing a first importance score vector

Then, the first importance scoring vector is input into a redundant information minimization criterion to carry out redundancy removal correction on different features to obtain a second importance vector s, wherein the second importance isThe sexual vector includes a second importance score for each feature. The criterion for minimizing redundant information is:

where λ is a regularization parameter used to balance the weights of the first term and the second term, and the redundancy matrix characterizes the correlation, i.e., the redundancy, between two features.

In particular, the first term s^TAs characterizes the redundancy of global features, so the smaller the first term should be, the better; second term w^Ts characterizes the consistency of the original feature first importance score w (redundancy not considered) and the modified feature second importance score s (redundancy considered), the second term should be as large as possible since it is desirable that the modified features are as consistent as possible with the feature importance before modification.

In the method provided by the above embodiment, the proposed global redundancy minimized sparse feature selection algorithm GRMS ranks the feature importance after calculating the second importance score vector, and simultaneously delineates which non-redundant features based on the redundant matrix to correct the sparsely selected features.

The method provided by the scheme aims at the problem that the existing method cannot select the characteristic subset with the global redundant information minimization, and provides a sparse characteristic selection algorithm with the global redundant minimization, so that the characteristic is subjected to redundant correction based on a redundant matrix, and the characteristic selection and the characteristic redundant minimization are simultaneously subjected to optimization solution.

Referring to fig. 5, a schematic diagram of main modules of a feature selection apparatus 500 according to an embodiment of the present invention is shown, including:

a feature determining module 501, configured to determine features in each sample, and construct a sample vector based on values of all the features in each sample; wherein, the sample is at least one of image, text and voice data;

a vector processing module 502, configured to process the sample vector by using a sparse feature selection algorithm with minimized global redundancy to obtain an importance score vector; wherein the importance score vector comprises the importance scores of the features;

a feature selection module 503, configured to extract a predetermined number of features with the largest importance scores from the importance scores of the features, so as to serve as the selected target features.

In the device for implementing the present invention, the vector processing module 502 is configured to:

The implementation device of the invention also comprises a redundancy matrix construction module used for:

In the implementation apparatus of the present invention, the redundancy matrix constructing module is configured to:

In addition, the detailed implementation of the device in the embodiment of the present invention has been described in detail in the above method, so that the repeated description is not repeated here.

FIG. 6 illustrates an exemplary system architecture 600 to which embodiments of the invention may be applied.

As shown in fig. 6, the system architecture 600 may include

terminal devices

601, 602, 603, a network 604, and a server 605 (by way of example only). The network 604 serves to provide a medium for communication links between the

terminal devices

601, 602, 603 and the server 605. Network 604 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

601, 602, 603 to interact with the server 605 via the network 604 to receive or send messages or the like. Various communication client applications can be installed on the

terminal devices

601, 602, 603.

The

terminal devices

601, 602, 603 may be various electronic devices having display screens and supporting web browsing, and the server 605 may be a server providing various services.

It should be noted that the method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the apparatus is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks, and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 701.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a feature determination module, a vector processing module, and a feature selection module. The names of these modules do not in some cases form a limitation on the module itself, and for example, a vector processing module may also be described as a "module that processes a sample vector to obtain an importance score vector".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

According to the technical scheme of the embodiment of the invention, a sparse feature selection algorithm based on global redundancy minimization is provided, redundancy correction is carried out on features based on a redundancy matrix, and feature selection and feature redundancy minimization are simultaneously optimized and solved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of feature selection, comprising:

2. The method of claim 1, wherein the processing the sample vector using a sparse feature selection algorithm with global redundancy minimization comprises:

3. The method of claim 1, wherein the processing the sample vector using a sparse feature selection algorithm with minimized global redundancy to obtain a significance score vector comprises:

4. The method of claim 2 or 3, further comprising:

5. The method of claim 4, wherein normalizing values of a feature in different samples to construct a feature vector of the feature comprises:

6. A feature selection apparatus, comprising:

7. The apparatus of claim 6, wherein the vector processing module is configured to:

8. The apparatus of claim 7, further comprising a redundancy matrix construction module to:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-5.